Lecture Notes For STAT2602
Lecture Notes For STAT2602
1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Random variable and probability function . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Discrete distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Continuous distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Empirical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Moment generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Point estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Method of moments estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Estimator properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Confidence intervals for means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 One-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Tow-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Confidence intervals for variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 One-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Two-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
v
vi Contents
Abstract A random variable usually results from a random experiment, and its ful-
l information is determined by its cumulative distribution function or probability
density function, which however is unknown in practice. By repeating the random
experiment several times (say, n times), we obtain a random sample which consists
of the results of all repeated random experiments. Based on this random sample,
we are able to estimate the cumulative distribution function (by the empirical distri-
bution) and the probability density function (by the relative frequency histogram).
This chapter reviews these basic concepts in statistics. If you are familiar with these
concepts, you can skip this chapter.
A random variable
X : E
X( )
1
2 1 Basic concepts
f (x) = P(X = x)
is the probability density function (p.d.f.) of X. Clearly, f (x) tells us how likely the
event { : X( ) = x} happens.
Remark 1.1. The proof of Theorem 1.1 follows directly from Definition 1.1, and the
details are omitted.
Remark 1.2. Any function f (x) satisfies (1)-(3) of Theorem 1.1 can induce a random
variable, which has a p.d.f. f (x).
Let f (x) = 0 for x E. Then, the domain of f (x) can be extended to R, and
supp( f ) = E. Here, supp( f ) = {x : f (x) = 0} is the support of f (x), i.e., supp( f ) is
the subset of the domain containing those elements which are not mapped to zero.
The graph of the p.d.f. of a discrete random variable X would be a plot of the points
{(x, f (x)) : x E}. However, it is easier to visualize the corresponding probabilities
if a vertical line segment is drawn from each (x, f (x)) to (x, 0), to form a bar graph
(see Fig.1.1 for an illustration).
3/6
2/6
f(x)
1/6
0
1 2 3
x
1
F (x)
1/2
1/6
1 2 3
x
Fig. 1.1 The top panel is the p.d.f F(x) of a discrete random variable X, where f (x) = P(X = x) =
x/6 for x = 1, 2, 3, and the bottom panel is the corresponding c.d.f. F(x).
4 1 Basic concepts
The function F(x) is called the cumulative distribution function (c.d.f.) of the dis-
crete random variable X. Note that F(x) is a step function on R and the height of a
step at x, x E, equals the probability f (x) (see Fig.1.1 for an illustration).
From Theorem 1.1, we can obtain the following theorem.
Remark 1.3. The p.d.f. f (x) and the c.d.f. F(x) are one-to-one corresponding. We
can first define the c.d.f. F(x), and then define the p.d.f. f (x) by
Example 1.4. Let the random variable X of the discrete type have the p.d.f. f (x) =
x/6, x = 1, 2, 3. Then,
0 for x < 1
1
for 1 x < 2
F(x) = 61
for 2 x < 3
2
1 for x 3.
Example 1.5. Let X be a discrete random variable taking value at 1, 2, and 3. Sup-
pose that F(x) is the c.d.f. of X, and it satisfies that
F(1) = 12
F(2) = c
F(3) = c2 56 c + 76 .
The above definitions of p.d.f and c.d.f. can be similarly extended for the discrete
multivariate random variable X = (X1 , , Xs ) R s for s > 1. Let x = (x1 , , xs )
E R s be a realization of X. Then, the p.d.f. of X is
f (x1 , , xs ) = P(X1 = x1 , , Xs = xs ).
fk (xk ) = f (x1 , , xs ).
x1 xk1 xk+1 xs
We call fk (xk ) the marginal p.d.f. of Xk . This marginal p.d.f. fk (xk ) is calculated
by summing f (x1 , , xs ) over all xi s except xk . Further, we say that X1 , , Xs are
independent if and only if
f j,k (x j , xk ) = f (x1 , , xs ).
x1 x j1 x j+1 xk1 xk+1 xs
Extensions of these marginal p.d.f.s to more than two random variables are made in
an obvious way.
Based on the p.d.f. f (x1 , , xs ), the c.d.f. of X = (X1 , , Xs ) R s is defined
by
F(x1 , , xs ) = f (e1 , , es ).
(e1 , ,es )E and e1 x1 , ,es xs
+ f (e1 , e2 ) f (e1 , e2 ).
e1 =x,e2 =y e1 <x,e2 <y
6 1 Basic concepts
Hence,
(b) for any real functions g() and h(), g(X) and h(Y ) are independent.
Proof. (a) Note that
= P(X = x) P(Y = y)
xA yB
as required.
(b) For any and , let A = {x : g(x) = } and B = {y : h(y) = }, and then the
joint p.d.f. of g(X) and h(Y ) is
It follows that g(X) and h(Y ) are independent. This completes the proof.
Example 1.6. Let the joint p.d.f. of X and Y be
xy2
f (x, y) = , x = 1, 2, 3, y = 1, 2.
30
The marginal p.d.f of X is
xy2 x
f1 (x) = 30
= ,
6
x = 1, 2, 3.
y=1,2
xy2 y2
f2 (y) = = , y = 1, 2.
x=1,2,3 30 5
The above distribution, denoted by B(n, p), is called the binomial distribution. Par-
ticularly, B(1, p) is called Bernoulli distribution.
The multinomial distribution is a natural extension of the binomial distribution.
Let the random variable Xi be the number of times Ai occurs in the n trials, i =
1, 2, , k. Then, the multinomial distribution of X1 , , Xk is defined by
( )
n x
f (x1 , , xk ) = P(X1 = x1 , , Xk = xk ) = px1 px2 pkk ,
x1 , x2 , , xk 1 2
8 1 Basic concepts
Example 1.8. A bowl contains three red, four white, two blue, and five green balls.
One ball is drawn at random from the bowl and the replaced. This is repeated 20
independent times. Let X1 , X2 , X3 , and X4 denote the numbers of red, white, blue,
and green balls drawn, respectively. What is the joint p.d.f. of X1 , X2 , X3 , and X4 ?
Solution. For each draw, there are four outcomes (say, red, white, blue, and green)
for the color of the ball, and it is easy to see that
3
p1 = P(the drawn ball is red) = ,
14
4
p2 = P(the drawn ball is white) = ,
14
2
p3 = P(the drawn ball is blue) = ,
14
5
p4 = P(the drawn ball is green) = .
14
Thus, the joint p.d.f. of X1 , X2 , X3 , and X4 is
f (x1 , x2 , x3 , x4 ) = P(X1 = x1 , X2 = x2 , X3 = x3 , X4 = x4 )
( ) ( )x1 ( )x2 ( )x3 ( )x4
20 3 4 2 5
= ,
x1 , x2 , x3 , x4 14 14 14 14
for x1 + x2 + x3 + x4 = 20.
which also satisfies Theorem 1.2. From the fundamental theorems of calculus, we
have F (x) = f (x) if exists. Since there are no steps or jumps in a continuous c.d.f.,
it must be true that P(X = b) = 0 for all real values of b.
As you can see, the definition for the p.d.f. (or c.d.f.) of a continuous random
variable differs from the definition for the p.d.f. (or c.d.f.) of a discrete random
variable by simply changing the summations that appeared in the discrete case to
integrals in the continuous case.
Proof.
P(Y x) = P(F 1 (X) x) = P(X F(x)) = F(x).
Note that this property helps us to generate a random variable from certain distribu-
tion.
where R is the location parameter and > 0 is the scale parameter. Briefly, we
say that X N( , 2 ). A simple illustration of f (x) with different values of and
is given in Fig.1.2.
Further, Z = (X )/ N(0, 1) (the standard normal distribution), and the
c.d.f. of Z is typically denoted by (x), where
x [ ]
1 w2
(x) = P(Z x) = exp dw.
2 2
Property 1.4. If the p.d.f. of a continuous random variable X is fX (x) for x R, the
a ) for x R.
p.d.f. of Y = aX + b for a = 0 is fY (x) = 1a fX ( xb
0.4 0.4
N (0, 1) N (2, 1)
0.35 N (0, 4) 0.35 N (2, 4)
0.3 0.3
0.25 0.25
f(x)
f(x)
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-10 -5 0 5 10 -5 2 10
x x
( ) ( )
xb xb
FY (x) = P(Y x) = P(aX + b x) = P X = FX
a a
for x R. Hence,
( ) ( )
1 xb 1 xb
fY (x) = FY (x) = F = fX .
a X a a a
The definitions of p.d.f. and c.d.f. for the continuous univariate random vari-
able can be similarly extended for the continuous multivariate random variable
X = (X1 , , Xs ) R s , s > 1. The probability of X lies in a rectangle A =
(a1 , b1 ] (a2 , b2 ] (as , bs ] is
b1 b2 bs
P(X A) = f (x1 , x2 , , xs )dx1 dx2 dxs
a1 a2 as
for some non-negative function f (x1 , x2 , , xs ). For instance, when s = 2, the prob-
ability P(X A) is the volume of the solid over the region A in the x1 x2 plane and
bounded by the surface z = f (x1 , x2 ). We call f (x1 , x2 , , xs ) the joint p.d.f. of the
continuous random variables X1 , , Xs .
Similar to the discrete case, the marginal p.d.f. of any single Xk is given by the
(s 1)-fold integral:
fk (xk ) = f (x1 , x2 , , xs )dx1 dxk1 dxk+1 dxs
for (x1 , , xs ) R s .
Based on the p.d.f. f (x1 , , xs ), the c.d.f. of a continuous multivariate random
variable X = (X1 , , Xs ) is
x1 xs
F(x1 , , xs ) = P(X1 x1 , , Xs xs ) = f (e1 , , es )de1 des ,
s F(x1 , , xs )
f (x1 , , xs ) = ,
x1 xs
if exists.
Property 1.5. Two continuous random variables X and Y are independent if and only
if
F(x, y) = FX (x)FY (y) for all (x, y) R2 .
Proof. The proof of the only if part is obvious, and the proof of the if part
follows by the fact that
Property 1.6. Let X and Y be two independent continuous random variables. Then,
(a) for arbitrary intervals A and B,
(b) for any real functions g() and h(), g(X) and h(Y ) are independent.
(b) For any and , let A = {x : g(x) } and B = {y : h(y) }, and then the
joint c.d.f. of g(X) and h(Y ) is
It follows that g(X) and h(Y ) are independent by Property 1.5. This completes the
proof.
f (x, y) = 2, 0 x y 1.
(b) By definition,
x y
a + 2 a+1 a + 2 a+1 1
F(x, y) = cua dvdu = yx + xa+2 x + (1 y)a+2 .
1y 1u a+1 a+1 a+1
1 n
Fn (x) = I(Xk x)
n k=1
for x R, where I(A) is an indicator function such that I(A) = 1 if A holds and
I(A) = 0 otherwise. Obviously, Fn (x) assigns the probability 1/n to each Xk , and
we can check that it satisfies Theorem 1.2 (please do it by yourself). Since Fn (x) is
the relative frequency of the event X x, it is an approximation of the probability
P(X x) = F(x). Thus, the following result is expected.
and this convergence can be further extended by the Donskers theorem, where d
stands for the convergence in distribution.
The proof of aforementioned theorem is omitted. This theorem shows that the
empirical distribution function Fn (x) provides an estimate of the c.d.f. F(x). To see
it more clearly, Fig.1.3 plots the empirical distribution function Fn (x) based on a
data sample {X1 , , Xn } with Xi N(0, 1). As a comparison, the c.d.f. (x) of
N(0, 1) is also included in Fig.1.3. From this figure, we can see that Fn (x) is getting
14 1 Basic concepts
close to (x) as the sample size n increases, and this is consistent to the conclusion
in Theorem 1.7.
0.8
Fn(x)
Fn(x)
0.4
0.4
0.0
0.0
1 0 1 2 2 1 0 1 2 3
x x
0.8
Fn(x)
Fn(x)
0.4
0.4
0.0
0.0
2 1 0 1 2 3 3 1 0 1 2 3
x x
Fig. 1.3 The black step function is the empirical distribution function Fn (x) based on a data sample
{X1 , , Xn } with Xi N(0, 1). The red solid line is the c.d.f. (x) of N(0, 1).
Example 1.13. Let X denote the number of observed heads when four coins are
tosses independently and at random. Recall that the distribution of X is B(4, 1/2).
One thousand repetitions of this experiment (actually simulated on the computer)
yielded the following results:
The graph of the empirical distribution function F1000 (x) and the theoretical distri-
bution function F(x) for the binomial distribution are very close (please check it by
yourself).
Example 1.14. The following numbers are a random sample of size 10 from some
distribution:
0.49, 0.90, 0.76, 0.97, 0.73, 0.93, 0.88, 0.75, 0.88, 0.96.
(a) Write done the empirical distribution; (b) use the empirical distribution to esti-
mate P(X 0.5) and P(0.5 X 0.5).
0.97, 0.88, 0.75, 0.73, 0.49, 0.76, 0.88, 0.90, 0.93, 0.96.
Thus, P(X 0.5) = F(0.5) F10 (0.5) = 0.5 and P(0.5 X 0.5) =
F(0.5) F(0.5) F10 (0.5) F10 (0.5) = 0.5 0.4 = 0.1.
The question now is how to estimate the p.d.f. f (x)? The answer is relative
frequency histogram.
For the discrete random variable X, we can estimate f (x) = P(X = x) by the
relative frequency of occurrences of x. That is,
nk=1 I(Xk = x)
f (x) fn (x) = .
n
Example 1.13. (cont) The relative frequency of observing x = 0, 1, 2, 3 or 4 is listed
in the second column, and it is close to the value of f (x), which is the p.d.f of
B(4, 1/2).
16 1 Basic concepts
Roughly speaking, the class intervals are a non-overlapped partition of the interval
[Xmin , Xmax ]. As f (x) = F (x), we expect that when c j1 and c j is close,
F(c j ) F(c j1 )
f (x) for x (c j1 , c j ], j = 1, 2, , l.
c j c j1
Note that
nk=1 I(Xk (c j1 , c j ])
F(c j ) F(c j1 ) = P(X (c j1 , c j ])
n
is the relative frequency of occurrences of Xk (c j1 , c j ]. Thus, we can approximate
f (x) by
nk=1 I(Xk (c j1 , c j ])
f (x) hn (x) = for x (c j1 , c j ], j = 1, 2, , l.
n(c j c j1 )
We call hn (x) the relative frequency histogram. Clearly, the way that we define the
class intervals is not unique, and hence the value of hn (x) is not unique. When the
sample size n is large and the length of the class interval is small, hn (x) is expected
to be a good estimate of f (x).
The property of hn (x) is as follows:
(i) hn (x) 0 for all x;
(ii) The total area bounded by the x axis and below hn (x) equals one, i.e.,
cl
hn (x)dx = 1;
c0
(iii) The probability for an event A, which is composed of a union of class intervals,
can be estimated by the area above A bounded by hn (x), i.e.,
P(A) hn (x)dx.
A
1.5 Expectation 17
Solution. (a) and (b). The frequency and the relative frequency histogram based on
the class intervals are given in the following table:
class interval frequency relative frequency class interval frequency relative frequency
histogram histogram
(2.005, 2.205] 1 0.1 (3.005, 3.205] 5 0.5
(2.205, 2.405] 2 0.2 (3.205, 3.405] 6 0.6
(2.405, 2.605] 2 0.2 (3.405, 3.605] 4 0.4
(2.605, 2.805] 7 0.7 (3.605, 3.805] 4 0.4
(2.805, 3.005] 14 1.4 (3.805, 4.005] 5 0.5
14
f (3) h50 (3) = = 1.4,
50 (3.005 2.805)
5
f (4) h50 (4) = = 0.5.
50 (4.005 3.805)
1.5 Expectation
where the summation is taken over all possible pairs of (x1 , , xn ). If E[u(X1 , , Xn )]
exists, it is called the mathematical expectation (or expected value) of u(X1 , , Xn ).
Remark 1.4. E[u(X1 , , Xn )] exists if
Cov(Xi , X j ) i j
(Xi , X j ) = =
Var(Xi )Var(X j ) i j
Cov(Xi , X j ) = E(Xi X j ) i j .
8 10 14
1 = x1 f1 (x1 ) =
18
+2
18
=
9
x1 =1,2
and
14 2 8 14 10 20
12 = (x1 1 )2 f1 (x1 ) = (1 )
9 18
+ (2 )2 = ;
9 18 81
x1 =1,2
and
29 2 7 29 11 77
22 = (x2 2 )2 f2 (x2 ) = (1 )
18 18
+ (2 )2 =
18 18 324
.
x2 =1,2
Cov(X1 , X2 ) 1/162 1
(X1 , X2 ) = = = = 0.025.
1 2 (20/81) (29/18) 1540
Property 1.7. Let X be a discrete random variable with finite mean E(X), and let a
and b be constants. Then,
(i) E(aX + b) = aE(X) + b;
(ii) if P(X = b) = 1, then E(X) = b;
(iii) if P(a < X b) = 1, then a < E(X) b;
(iv) if g(X) and h(X) have finite mean, then
Proof. By definition,
x
E(X) = x f (x) = f (x) 1 = f (x) I(r x) = f (x)I(r x).
x=1 x=1 r=1 x=1 r=1 x=1 r=1
Example 1.17. A coin shows a head with probability p. Then, how many times do
you expect to toss the coin until it first shows a head?
Solution. Let the required number of tosses until the first head be T . Then, as each
toss is independent, P(T = x) = qx1 p for x 1, where q = 1 p. Hence,
1
E(T ) = xqx1 p = p .
x=1
Another simple way to calculate E(T ) is by noting that P(T > x) = qx for x 0 and
1
E(T ) = P(T > x) = qx = p .
x=0 x=0
Cov(Xi , X j ) i j
(Xi , X j ) = =
Var(Xi )Var(X j ) i j
The first integrand is an odd function, and so the integral over R is zero. The second
integrand is one by some algebra. Hence, E(X) = .
Example 1.19. Let X have density f (x) = ( 1)x for x 1 and > 1. Then,
n
1 1
E(X) = x dx = ( 1) lim dx.
1 x n 1 x 1
When 2 (i.e., 1 1), E(X) = . When > 2, it is not hard to see that
1
E(X) = .
2
22 1 Basic concepts
Note that
b b
x f (x)dx a f (x)dx = a(F(b) F(a)) = a,
a a
b b
x f (x)dx b f (x)dx = b(F(b) F(a)) = b.
a a
Property 1.10. Let X be a non-negative random variable with c.d.f. F, p.d.f f , and
finite expected value E(X). Then, if limn n(F(n) 1) = 0,
E(X) = (1 F(x))dx.
0
E(X 2 ) = E(X 2 I(a |X| b)) + E(X 2 I(|X| < a)) + E(X 2 I(|X| > b))
E(X 2 I(a |X| b))
a2 E(I(a |X| b))
= a2 P(a |X| b) > 0,
which is contradict with the condition that E(X 2 ) = 0. This completes the proof of
(i).
The proofs of remaining parts are left as an excise. (Hint: the proof of (v) and (vi)
relies on the Cauchy-Schwarz inequality below to the random variables X E(X)
and Y E(Y )).
24 1 Basic concepts
Proof. Without loss generality, we assume that E(Y 2 ) > 0. Note that
[ ] [ ]
0 E (XE(Y 2 ) Y E(XY ))2 = E(Y 2 ) E(X 2 )E(Y 2 ) (E(XY ))2 .
Abstract This chapter talks about the moment generating function and some im-
portant convergence, including the the law of large number theory and central limit
theory. The moment generating function is one-to-one corresponding to the cumu-
lative distribution function, and hence it can determine the distribution of a random
variable. In many applications, the expected value of a random experiment is partic-
ularly important, and it is estimated by the sample mean. The law of large number
theory makes sure that the sample mean is a rational estimate of the expected value,
and the variation of this estimate can be measured by the central limit theory.
Let r be a positive integer. The r-th moment about the origin of a random variable X
is defined as r = E(X r ). In order to calculate r , we can make use of the moment
generating function (m.g.f.).
25
26 2 Preliminary
For a continuous random variable X, the proof is similar by using integrals instead
of sums.
(2) Make use of (1).
[ ]
(3) MaX+b (t) = E e(aX+b)t = ebt E(eatX ) = ebt MX (at).
The above property shows that we can decide the distribution of X by calculating
its m.g.f.
Example 2.2. Find the moment generating function of a random variable X follow-
ing a Poisson distribution with mean .
Solution.
e x ( et )x
etx P(X = x) = etx = e = e e e
t
MX (t) = E(etX ) =
x=0 x=0 x! x=0 x!
= e (e 1) .
t
2.1 Moment generating function 27
Example 2.3. Find the moment generating function of a random variable which has
a (probability) density function given by
{
ex , for x > 0;
f (x) =
0, otherwise,
Solution.
MX (t) = E(etX )
+
(t1)x
+ + e 1
= , for t < 1;
= etx f (x)dx = etx ex dx = t 1 1t
0
0
does not exist, for t 1.
Then,
(1) 1
1 = MX (0) = = 1,
(1 t)2 t=0
(2) 2
2 = MX (0) = = 2,
(1 t)3 t=0
(3) 2 3
3 = MX (0) = = 3!.
(1 t)4 t=0
Property 2.3. If X1 , X2 , . . . , Xn are independent random variables, MXi (t) exists for
i = 1, 2, , n, and Y = X1 + X2 + + Xn , then MY (t) exists and
n
MY (t) = MXi (t).
i=1
Example 2.4. Find the probability distribution of the sum of n independent random
variables X1 , X2 , . . . , Xn following Poisson distributions with means 1 , 2 , . . . , n re-
spectively.
i=1 i=1
which is the m.d.f. of a Poisson random variable with mean ni=1 i . Therefore, by
Example 2.2 and Property 2.3, Y follows the Poisson distribution with mean ni=1 i .
28 2 Preliminary
Example 2.5. For positive numbers and , find the moment generating function
of a gamma distribution Gamma( , ) of which the density function is given by
1 x
x e
, for x > 0;
f (x) = ( )
0, otherwise.
Solution.
+
tX x 1 e x
MX (t) = E(e ) = etx dx
0 ( )
+
= x 1 e( t)x dx
0 ( )
+ ( t)
=
x 1 e( t)x dx
( t) 0 ( )
, for t < ;
= ( t)
does not exist, for t ,
where +
( t) 1 ( t)x
x e dx = 1
0 ( )
is due to the fact that
( t) 1 ( t)x
x e for x > 0
( )
Example 2.6. Find the distribution of the sum of n independent random variables
X1 , X2 , . . . , Xn where Xi follows Gamma(i , ), i = 1, 2, . . . , n, with the p.d.f. given
by 1 x
ix i e
, for x > 0;
f (x) = (i )
0, otherwise.
Solution. From the previous example, we know that the moment generating function
of Xi is ( )i
MXi (t) = for t < , i = 1, 2, . . . , n.
t
Hence, the moment generating function of X1 + X2 + + Xn is
( )1 +2 ++n
n
Xi M (t) =
t
for t < .
i=1
2.2 Convergence 29
2.2 Convergence
1 n
X= Xi ,
n i=1
1 n
S2 = (Xi X)2 .
n i=1
P(|Zn Z| > ) 0 as n .
For brevity, this is often written as Zn p Z.
Theorem 2.1. (Weak law of large numbers) Let (Xn ; n 1) be a sequence of inde-
pendent random variables having the same finite mean and variance, = E(X1 )
and 2 = Var(X1 ). Then, as n ,
X p .
= n 2 .
Property 2.4. (Chebyshovs inequality) Suppose that E(X 2 ) < . Then, for any con-
stant a > 0,
E(X 2 )
P(|X| a) .
a2
Proof. This is left as an excise.
Property 2.5. If Xn p and Yn p , then (i) Xn +Yn p + ; (ii) XnYn p ;
(iii) Xn /Yn / if Yn = 0 and = 0; (iv) g(Xn ) p g( ) for a continuous function
g().
Proof. The proof is omitted.
Example 2.8. Let (Xn ; n 1) be a sequence of independent random variables having
the same finite mean = E(X1 ), finite variance 2 = Var(X1 ), and finite fourth
moment 4 = E(X14 ). Show that
S2 p Var(X1 ).
To prove the above central limit theorem, we need the following lemma:
Lemma 2.1. If
1. MZn (t), the moment generating function of Zn , exists, n = 1, 2, . . .,
2. lim MZn (t) exists and equals the moment generating function of a random vari-
n
able Z,
then
lim GZn (x) = GZ (x) for all x at which GZ (x) is continuous,
n
where GZn (x) is the distribution function of Zn , n = 1, 2, . . ., and GZ (x) is the distri-
bution function of Z.
Proof. As (Xn ; n 1) is a sequence of independent random variables having the
same finite mean = E(X1 ) and finite variance 2 = Var(X1 ), simple algebra gives
us that
32 2 Preliminary
1 n 2
E(X) =
n i=1
E(Xi ) = and Var(X) = E[(X)2 ] [E(X)]2 =
n
.
(1) t 2 (2)
MYi (t) = MYi (0) + tMYi (0) + M ( ), for some 0 t.
2 Yi
Since Zn = 1 ni=1 Yi , then the moment generating function of Zn is thus given by
n
n ( )
t
MZn (t) = MYi
i=1 n
[ ( )]n
t
= MYi
n
[ ]n
t (1) (t/ n)2 (2)
= MYi (0) + MYi (0) + MYi ( )
n 2
[ ]n
t t 2 (2)
= 1 + E(Yi ) + MYi ( )
n 2n
[ 2 ]n
t (2)
= 1 + MYi ( ) ,
2n
(2) (2)
where 0 t/ n. As n , 0 and MYi ( ) MYi (0) = E(Yi2 ) = 1. Hence,
( )n ( 2) ( )
t2 t 1
lim MZn (t) = lim 1 + = exp = exp 0 t + 1 t 2
n n 2n 2 2
which is the moment generating function of N(0, 1) random variable. Hence, the
conclusion follows directly from Lemma 2.1.
Chapter 3
Point estimator
Example 3.1. Suppose that X follows a Bernoulli distribution so that the p.d.f. of X
is
f (x; p) = px (1 p)1x , x = 0, 1,
where the unknown parameter p with = {p : p (0, 1)}. Further, assume
that we have a random sample X = {X1 , X2 , , Xn } with the observable values x =
{x1 , x2 , , xn }, respectively. Then, the probability that X = x is
L(x1 , , xn ; p) = P(X1 = x1 , X2 = x2 , , Xn = xn )
n
= pxi (1 p)1xi = pi=1 xi (1 p)ni=1 xi ,
n n
i=1
which is the joint p.d.f. of X1 , X2 , , Xn evaluated at the observed values. The joint
p.d.f. is a function of p. Then, we want to find the value of p that maximizes this
33
34 3 Point estimator
The way to propose p is reasonable because p most likely has produced the sample
values x1 , , xn . We call p the maximum likelihood estimate, since likelihood
is often used as a synonym for probability in informal contexts.
Conventionally, we denote L(p) = L(x1 , , xn ; p), and p is easier to be com-
puted by
p = arg max log L(p).
p
[Note that p maximizes log L(p) also maximizes L(p)]. By simple algebra (see one
example below), we can show that
1 n
p = xi ,
n i=1
1 n
p = Xi .
n i=1
Note that
( )
d(p) 1 n 1 n
= Xi n Xi
dp p i=1 1 p i=1
n n
(1 p) Xi np + p Xi
i=1 i=1
=
p(1 p)
n(X p)
= .
p(1 p)
Solution. Note that a uniformly distribution over the interval [0, ] has the p.d.f.
given by
1 , for 0 x , 1
f (x; ) = = I(0 x ).
0, otherwise,
For the random sample X, the likelihood function is
n
1 1 n
L( ) = I(0 Xi ) = n I(0 Xi ).
i=1 i=1
0 Xi , i = 1, 2, . . . , n.
1
Since increases as decreases, we must select to be as small as possible sub-
n
ject to the previous constraint. Therefore, the maximum of L( ) should be selected
to be the maximum of X1 , X2 , . . . , Xn , that is, the maximum likelihood estimator
= X(n) = max1in Xi .
n n (Xi 1 )2
( ) = log L( ) = log(22 ) i=1 .
2 22
( ) 1 n
0=
1
= (Xi 1 ),
2 i=1
( ) n 1 n
0= = + 2 (Xi 1 )2 .
2 22 22 i=1
1 n 1 n
1 = X =
n i=1
Xi and 2 = S2 = (Xi X)2 .
n i=1
By considering the usual condition on the second partial derivatives, these solutions
do provide a maximum. Thus, the MLEs of 1 and 2 are
1 = X and 2 = S2 ,
respectively.
or
= 1 and 2 = 2 12 .
That is, the unknown parameters and 2 can be estimated if we find good
estimators for 1 and 2 . Note that by the weak low of large numbers (see Theorem
2.1),
1 n 1 n
m1 =
n i=1
Xi p E(X) = 1 and m2 = Xi2 p E(X 2 ) = 2 .
n i=1
= h(1 , 2 , , k ), (3.1)
h1 (1 , 2 ) = 1 and h2 (1 , 2 ) = m2 m21 .
1 n r
mr = Xi ,
n i=1
r = 1, 2, . . . .
Unlike r , mr always exists for any positive integer r. In view of (3.1), the method
of moments estimator (MME) of is defined by
= h(m1 , m2 , , mk ),
Find a MME of ( , ).
38 3 Point estimator
Solution. Some simple algebra shows that the first two moments are
2 +
1 = and 2 = .
2
[Note: 1 and 2 can be obtained from Example 2.6.] Substituting = 1 in the
second equation, we get
( 1 )2 + 1 1 1
2 = = 12 + or = ,
2 2 12
It is worth noting that the way to construct h in (3.1) is not unique. Usually, we
use the lowest possible order moments to construct f , although this is may not be
the optimal way. To consider the optimal MME, one may refer to the generalized
method of moments estimator for a further reading.
For the same unknown parameter , many different estimators may be obtained.
Heuristically, some estimators are good and others bad. The question is how would
we establish a criterion of goodness to compare one estimator with another? The
particular properties of estimators that we will discuss below are unbiasedness, effi-
ciency, and consistency.
3.3.1 Unbiasedness
Bias( ) = E( ) .
Example 3.3. (cont) (i) Show that = X(n) is an asymptotically unbiased estimator
of ; (ii) modify this estimator of to make it unbiased.
1 n 1 n
E(X) =
n i=1
E(Xi ) = 1 = 1 .
n i=1
1 n [ ] n1
E(S2 ) = E (Xi X)2 = n 2 .
n i=1
3.3.2 Efficiency
Suppose that we have two unbiased estimators and . The question is how to
compare and in terms of a certain criterion. To answer this question, we first
introduce the so-called mean squared error of a given estimator .
Definition 3.6. (Mean squared error) Suppose that is an estimator of . The mean
squared error of is [( )2 ]
MSE( ) = E .
For a given estimator , MSE( ) is the mean (expected) value of the square of
the error (difference) . This criterion can be decomposed by two parts as shown
below.
Property 3.1. If Var( ) exists, then the mean squared error of is
[ ]2
MSE( ) = Var( ) + Bias( ) .
Proof.
[( )2 ]
MSE( ) = E
({[ ] [ ]}2 )
=E E( ) + E( )
([ ]2 [ ][ ] [ ]2 )
= E E( ) + 2 E( ) E( ) + E( )
3.3 Estimator properties 41
[ ][ ] [ ]2
= Var( ) + 2E E( ) E( ) + Bias( )
[ ]2
= Var( ) + Bias( ) .
Remark 3.1. The following result is straightforward:
MSE( ) = Var( )
by Property 3.1. Now, for two unbiased estimators and , we only need to se-
lect the one with a smaller variance, and this motivates us to define the efficiency
between and .
Definition 3.7. (Efficiency) Suppose that and are two unbiased estimators of .
The efficiency of relative to is defined by
Var( )
Eff ( , ) = .
Var( )
(1)
samples, one of size n1 and one of size n2 , and denote the sample means as X and
(2)
X , respectively. Then,
(2)
(1) (2) Var(X ) n1
Eff (X ,X )= (1)
= .
Var(X ) n2
Therefore, the larger is the sample size, the more efficient is the sample mean for
estimating .
n+1
Example 3.3. (cont) Note that Y is an unbiased estimator of , where Y is the
n
n-th order statistic. Show that
(i) 2X is also an unbiased estimator of ;
(ii) Compare the efficiency of these two estimators of .
Solution. (i) Since E(X) equals the population mean, which is /2, E(2X) = .
Thus, 2X is an unbiased estimator of .
42 3 Point estimator
(ii) First we must find the variance of the two estimators. Before, we have already
obtained ( )n
y
P(Y y) = for 0 y .
Therefore, for Z = Y 2 , it is not hard to show that
( )n
z
P(Z z) = P(Y z) = for 0 z 2 .
2 2
Var(2X) = 4Var(X) = 4 = .
12n 3n
Therefore,
3.3 Estimator properties 43
( )
n+1 Var(2X) n+2
Eff Y, 2X = ( )= .
n Var n+1
n Y 3
(i) Var( ) has a lower bound for all unbiased estimators; (3.2)
(ii) the variance of achieves this lower bound. (3.3)
It is worth noting that conditions (3.2)-(3.3) are not necessary for the UMVUE, since
there are some cases that the UMVUE can not achieve the lower bound in (3.2).
To consider the lower bound of Var( ), we need introduce the so-called Fisher
information.
1
Var( ) ,
In ( )
i=1
[ ( ) ]
n
ln f (Xi ; ) 2 n n
ln f (X ; ) ln f (X ; )
=E + E
i j
Hence, it gives us that 1 Var( ) In ( ), which implies that the Cramer-Rao in-
equality holds.
1
Corollary 3.1. If is an unbiased estimator of and Var( ) = , then is a
In ( )
UMVUE of .
or [ ] ( )
2 ln f (X; ) 1 1
I( ) = E = E = 2.
2 2
Hence,
1 1 2
CRLB = = = .
In ( ) nI( ) n
2
Recall that E(X) = and Var(X) = n . Thus, X is a UMVUE of .
Solution. For x = 0 or 1,
f (x; ) = x (1 )1x ,
ln f (x; )
= [x ln + (1 x) ln(1 )]
x 1x
=
1
x 1
= .
(1 ) 1
Noting that [ ]
X 1
E = ,
(1 ) 1
we have
[( )2 ] [ ]
ln f (X; ) X (1 ) 1
I( ) = E = Var = 2 = .
(1 ) (1 )2 (1 )
Hence,
1 1 (1 )
CRLB = = = .
In ( ) nI( ) n
3.3 Estimator properties 47
(1 )
Since E(X) = and Var(X) = n , X is a UMVUE of .
If we know the full information about the population distribution X, the following
theorem tells us that the MLE tends to be the first choice asymptotically.
d N(0, 1).
1/In ( )
1
If is an unbiased estimator of , the above theorem implies that Var( )
In ( )
when n is large. That is, the MLE can achieve the CRLB asymptotically.
3.3.3 Consistency
In the previous discussions, we have restricted our attention to the unbiased estima-
tor, and proposed a way to check whether an unbiased estimator is UMVUE. Now,
we introduce another property of the estimator called the consistency.
p ,
We shall mention that the unbiasedness along does not imply the consistency. A
toy example is as follows. Suppose that
3.2, X is a consistent estimator of (This is just the weak law of large numbers in
Theorem 2.1).
For S2 , we have
1 n 1 n 1 n
S2 =
n i=1
(Xi X)2 = (Xi2 + (X)2 2Xi X) = Xi2 (X)2 .
n i=1 n i=1
1 n 2
Xi p 2 = E(X12 ).
n i=1
(X)2 p 2 .
49
50 4 Interval Estimation
( )
X
= P 2 2
1/4
= P (2 Z 2)
0.9544,
where Z N(0, 1) and we have used the fact that X N( , 1/4). Thus, we have
over a 95% change of covering the unknown parameter with our interval estimator.
Note that for any point estimator of , we have P( = ) = 0. Sacrificing some
precision in the interval estimator, in moving from a point to an interval, has resulted
in increased confidence that our assertion about is correct.
The certainty of the confidence (or guarantee) is quantified in the following defi-
nition.
1 = inf P ,
Remark 4.1. In many situations, the coverage probability P is free of , and hence
1 = P( [L(X),U(X)]).
where we can show that T has the p.d.f. fT (x) = nxn1 for 0 x 1. Hence, it
follows that
( ) 1/a ( )n ( )n
1 1 1 1
P T = nxn1 dx = .
b a 1/b a b
4.2 Confidence intervals for means 51
( 1 )n ( 1 )n
That is, the coverage probability P of the first interval is free of , and a b
is the confidence coefficient of this interval.
For the second interval, we have
P = P ( [Y + c,Y + d]) = P ( d Y c)
( )
d c
= P 1 T 1
1c/
= nxn1 dx
1d/
( ) ( )
c n d n
= 1 1 .
Hence, we find that the coverage probability P of the second interval is not free of
. Moreover, since ( ) ( )
c n d n
lim 1 1 = 0,
we know that inf P = 0 (i.e., the confidence coefficient of this interval is 0).
Now, the question is how to construct the interval estimator. One important way
to do it is using the pivotal quantity.
Definition 4.4. (Pivotal Quantity) A random variable Q(X, ) = Q(X1 , , Xn , ) is
a pivotal quantity if the distribution of Q(X, ) is free of . That is, regardless of
the distribution of X, Q(X, ) has the same distribution for all values of .
Logically, when Q(X, ) is a pivotal quantity, we can easily construct a 1
confidence interval for Q(X, ) by
( )
1 = P L e Q(X, ) U
e , (4.1)
X N( , 2 /n). (4.2)
X
Hence, when 2 is known, Z = N(0, 1) is a pivotal quantity involving .
/ n
Let
( )
1 = P z /2 Z z /2
( )
X
= P z /2 z /2
/ n
( )
= P X z /2 X + z /2 ,
n n
where z satisfies
P(Z z ) =
for Z N(0, 1). Usually, we call z the upper percentile of N(0, 1) at the level ;
see Fig. 4.1. So, when 2 is known, a 1 confidence interval of is
[ ]
X z /2 , X + z /2 . (4.3)
n n
Given the observed value of X = x and the value of z /2 , we can calculate the
area is 1
area is /2 area is /2
z/2 z/2
interval estimate of by
4.2 Confidence intervals for means 53
[ ]
x z /2 , x + z /2 .
n n
As the point estimator, the 1 confidence interval is also not unique. Ideally, we
should choose it as narrow as possible in some sense, but in practice, we usually
choose the equal-tail confidence interval as in (4.3) for convenience, since tables for
selecting equal probabilities in the two tails are readily available.
Example 4.3. A publishing company has just published a new college textbook. Be-
fore the company decides the price of the book, it wants to know the average price
of all such textbooks in the market. The research department at the company took
a sample of 36 such textbooks and collected information on their prices. This in-
formation produced a mean price of $48.40 for this sample. It is known that the
standard deviation of the prices of all such textbooks is $4.50. Construct a 90% con-
fidence interval for the mean price of all such college textbooks assuming that the
underlying population is normal.
Solution. From the given information, n = 36, x = 48.40 and = 4.50. Now, 1 =
0.9, i.e., = 0.1, and by (4.3), the 90% confidence interval for the mean price of all
such college textbooks is given by
4.50 4.50
[x z /2 , x + z /2 ] = [48.40 z0.05 , 48.40 + z0.05 ]
n n 36 36
[47.1662, 49.6338].
Example 4.4. Suppose the bureau of the census and statistics of a city wants to esti-
mate the mean family annual income for all families in the city. It is known that
the standard deviation for the family annual income is 60 thousand dollars. How
large a sample should the bureau select so that it can assert with probability 0.99
that the sample mean will differ from by no more than 5 thousand dollars?
Thus, the sample size should be at least 956. (Note that we have to round 955.5517
up to the next higher integer. This is always the case when determining the sample
size.)
54 4 Interval Estimation
nS2
Note that W is n2 and Z 2 is 12 by Property 4.2(i). Since and Z 2 are independent
2
nS2
by (i), we can show that the m.g.f. of 2 is the same as the one of n1
2 . Hence, (ii)
holds.
(iii) Note that
(X )/( / n)
T= .
nS2 / 2 1/(n 1)
Hence, by Property 4.2(ii), T is tn1 .
From Property 4.1(iii), we know that T is a pivotal quantity of . Let
4.2 Confidence intervals for means 55
( )
1 = P t /2,d f =n1 T t /2,d f =n1
( )
X
= P t /2,d f =n1 t /2,d f =n1
S/ n 1
( )
S S
= P X t /2,d f =n1 X + t /2,d f =n1 ,
n1 n1
where t ,d f =k satisfies
P(T t ,d f =k ) =
for a random variable T tk ; see Fig. 4.2. So, when 2 is unknown, a 1 confi-
dence interval of is
[ ]
S S
X t /2,d f =n1 , X + t /2,d f =n1 . (4.4)
n1 n1
Given the observed value of X = x, S = s, and the value of t /2,d f =n1 , we can
area is 1
area is /2 area is /2
t/2,df=n t/2,df=n
function of tn tends to that of N(0, 1); see Fig. 4.3. That is, in tests or exams, if
n is so large that the value of t ,d f =n cannot be found, you may use z instead.
t3
t10
t20
N (0, 1)
Example 4.5. A paint manufacturer wants to determine the average drying time of
a new brand of interior wall paint. If for 12 test areas of equal size he obtained a
mean drying time of 66.3 minutes and a standard deviation of 8.4 minutes, construct
a 95% confidence interval for the true population mean assuming normality.
s 3.0
t /2,d f =n1 2.010 = 0.8614.
n 50 1
4.2 Confidence intervals for means 57
Thus, the 95% confidence interval is [14.75 0.86, 14.75 + 0.86], or [13.89, 15.59].
Besides the confidence interval for the mean of one single normal distribution, we
shall also consider the problem of constructing confidence intervals for the differ-
ence of the means of two normal distributions when the variances are unknown.
Let X = {X1 , X2 , , Xn } and Y = {Y1 ,Y2 , ,Ym } be random samples from in-
dependent distributions N(X , X2 ) and N(Y , Y2 ), respectively. We are of interest
to construct the confidence interval for X Y when X2 = Y2 = 2 .
First, we can show that
(X Y ) (X Y )
Z=
2 /n + 2 /m
is N(0, 1). Also, by the independence of X and Y, from Property 4.1(ii), we know
that
nS2 mS2
U = 2X + 2Y
is n+m2
2 . Moreover, by Property 4.1(i), V and U are independent. Hence,
Z
T=
U/(n + m 2)
[(X Y ) (X Y )]/ 2 /n + 2 /m
=
(nSX2 + mSY2 )/[ 2 (n + m 2)]
(X Y ) (X Y )
=
R
is tn+m2 , where ( )
nSX2 + mSY2 1 1
R= + .
n+m2 n m
That is, T is a pivotal quantity of X Y . Let
( )
1 = P t /2,d f =n+m2 T t /2,d f =n+m2
( )
(X Y ) (X Y )
= P t /2,d f =n+m2 t /2,d f =n+m2
R
( )
= P (X Y ) t /2,d f =n+m2 R X Y (X Y ) + t /2,d f =n+m2 R ,
58 4 Interval Estimation
where ( )
ns2X + msY2 1 1
r= + .
n+m2 n m
Example 4.7. Suppose that scores on a standardized test in mathematics taken by
students from large and small high schools are N(X , 2 ) and N(Y , 2 ), respec-
tively, where 2 is unknown. If a random sample of n = 9 students from large high
schools yielded x = 81.31, s2X = 60.76 and a random sample of m = 15 students
from small high schools yielded y = 78.61, sY2 = 48.24, the endpoints for a 95%
confidence interval for X Y are given by
( )
9 60.76 + 15 48.24 1 1
81.31 78.61 2.074 +
22 9 15
because P(T 2.074) = 0.975. That is, the 95% confidence interval is [3.95, 9.35].
nS2
First, we consider the one-sample case. By Property 4.1(ii), n1
2 is a pivotal
2
quantity involving 2 . Let
( )
nS2
1 = P 12
/2,d f =n1 2
/2,d f =n1
2
( )
nS2 nS2
=P 2
2
,
2 /2,d f =n1 1 /2,d f =n1
where 2 ,d f =n satisfies
P(T 2 ,d f =n ) =
for a random variable T n2 ; see Fig. 4.4. So, a 1 confidence interval of 2 is
4.3 Confidence intervals for variances 59
[ ]
nS2 nS2
, 2 . (4.6)
2 /2,d f =n1 1 /2,d f =n1
Given the observed value of S = s and the values of 2 /2,d f =n1 and 1
2
/2,d f =n1 ,
we can calculate the interval estimate of by
2
[ ]
ns2 ns2
, 2 .
2 /2,d f =n1 1 /2,d f =n1
area is 1
area is
/2
area is /2
21/2,df=n 2/2,df=n
Example 4.8. A machine is set up to fill packages of cookies. A recently taken ran-
dom sample of the weights of 25 packages from the production line gave a variance
of 2.9 g2 . Construct a 95% confidence interval for the standard deviation of the
weight of a randomly selected package from the production line.
the 95% confidence interval for the population variance is (1.8420, 5.8468). Tak-
ing positive square roots, we obtain the 95% confidence interval for the population
standard deviation to be (1.3572, 2.4180).
Next, we consider the two-sample case. Let X = {X1 , X2 , , Xn } and Y = {Y1 ,Y2 , ,Ym }
be random samples from independent distributions N(X , X2 ) and N(Y , Y2 ), re-
spectively. We are of interest to construct the confidence interval for X2 /Y2 .
Property 4.3. Suppose that U r21 and V r22 are independent. Then,
U/r1
Fr1 ,r2 =
V /r2
By Property 4.1(ii),
nSX2 mSY2
n1
2
and m1
2
.
X2 Y2
P(T F ,d f =(m,n) ) =
for a random variable T Fm,n ; see Fig. 4.5. So, a 1 confidence interval of
X2 /Y2 is
[ ]
n(m 1)SX2 n(m 1)SX2
F1 /2,d f =(m1,n1) , F /2,d f =(m1,n1) . (4.7)
m(n 1)SY2 m(n 1)SY2
4.3 Confidence intervals for variances 61
area is 1
area is
/2
area is /2
H0 : 0 versus H1 : 1 ,
63
64 5 Hypothesis testing
H0 : = 80 versus H1 : = 80.
If both the two hypotheses are simple, the null hypothesis H0 is usually chosen
to be a kind of default hypothesis, which one tends to believe unless given strong
evidence otherwise.
Example 5.2. Suppose that the score of STAT2602 follows N( , 10), and we want
to know whether the theoretical mean = 80 or 70. In this case,
H0 : = 80 versus H1 : = 70.
Usually, the test statistic is a functional on the random sample X = {X1 , , Xn }, and
it is denoted by W (X). Some important terms about the test statistic are as follows:
Rejection region or critical region the set of values of the test statistic for which
the null hypothesis is rejected;
Acceptance region the set of values of the test statistic for which the null hypoth-
esis is not rejected (is accepted);
Type I error rejection of the null hypothesis when it is true;
Type II error acceptance of the null hypothesis when it is false.
Accept H0 Reject H0
H0 is true No error Type I error
H0 is false Type II error No error
{W (X) R}.
( ) = P (W (X) R) for 0 ;
( ) = P (W (X) Rc ) for 1 .
(p) = P p (W 14)
= 1 P p (W > 14)
20 ( )
20 k
= 1 p (1 p)20k
k=15 k
{
0.0113, for p = 0.9;
0.8744, for p = 0.6.
(This implies that the probability of committing a type I and type II error are 0.0113
and 0.1256, respectively.)
Example 5.4. Let X1 , , Xn be a random sample from N( , 2 ), where 2 is
0
known. Consider a test statistic T = X
/ n
for hypotheses H0 : 0 versus H1 :
> 0 . Assume that the rejection region is {T K}. Then, the power function is
( ) = P (T K)
( )
X
= P K+ 0
/ n / n
( )
0
= P Z K+ ,
/ n
The ideal power function is 0 for 0 and 1 for 1 . However, this ideal
can not be attained in general. For a fixed sample size, it is usually impossible to
make both types of error probability arbitrarily small. In searching for a good test,
it is common to restrict consideration to test that control the type I error probability
at a specified level. Within this class of tests we then search for tests that have type
II error probability that is as small as possible.
The size defined below is used to control the type I error probability.
Definition 5.2. (Size) For [0, 1], a test with power function ( ) is a size test
if
sup ( ) = .
0
Example 5.5. Suppose that we want to test the null hypothesis that the mean of a
normal population with 2 = 1 is 0 against the alternative hypothesis that it is 1 ,
where 1 > 0 .
(a) Find the value of K such that {X K} provides a rejection region with the level
of significance = 0.05 for a random sample of size n.
(b) For the rejection region found in (a), if 0 = 10, 1 = 11 and we need the type
II probability 0.06, what should n be?
= (0 ) = P0 (X K)
( )
X 0 K 0
= P 0
/ n / n
( )
K 0
=P Z ,
1/ n
which is equivalent to
K 0 1.645
= z0.05 1.645 or K 0 + .
1/ n n
(b) Note that H0 : = 10, H1 : = 11, and the rejection region is {X K}. By
definition,
5.1 Basic concepts 67
= (1 ) = P1 (X < K)
( )
X 1 K 1
= P 1 <
/ n / n
( )
K 1
=P Z<
/ n
( )
0 + 1.645
1
n
P Z< .
/ n
Hence,
0.06 n + 1.645 z0.06 1.555
n (1.645 + 1.555)2 10.24,
Remark 5.2. In the above example, the value of K in the rejection region {X K}
is determined by the significance level . For the test statistic X, the value of K
uniquely decides whether the null hypothesis is rejected or not, and it is usually
called a critical value of this test.
Definition 5.3. (p-value) Let W (x) be the observed value of the test statistic W (X).
Case 1: The rejection region is {W (X) K}, then
p-value W (x) K
the observed value of W (X) falls in the rejection region
H0 is rejected at the significance level .
Definition 5.4. (Most powerful tests) A test concerning a simple null hypothesis
= 0 against a simple alternative hypothesis = 1 is said to be most powerful if
the power of the test at = 1 is a maximum.
L( ) = f (X1 , X2 , . . . , Xn ; ),
P {(X1 , X2 , . . . , Xn ) C} = for = 0 ,
and
f (x1 , x2 , . . . , xn ; 0 )
k when (x1 , x2 , . . . , xn ) C,
f (x1 , x2 , . . . , xn ; 1 )
f (x1 , x2 , . . . , xn ; 0 )
k when (x1 , x2 , . . . , xn ) C,
f (x1 , x2 , . . . , xn ; 1 )
Construct a test, called the likelihood ratio test, which rejects H0 : = 0 and
accepts H1 : = 1 if and only if (X1 , X2 , . . . , Xn ) C. Then any other test which
has significance level has power not more than that of this likelihood ratio
test. In other words, the likelihood ratio test is most powerful among all tests having
significance level .
Proof. Suppose D is the rejection region of any other test which has significance
level . We consider first the continuous case. Note that
= P {(X1 , X2 , . . . , Xn ) C} for = 0
= f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn ;
C
70 5 Hypothesis testing
= P {(X1 , X2 , . . . , Xn ) D} for = 0
= f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn .
D
Subtracting
f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn ,
CD
we get
f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn
CD
f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn ,
C D
or
P {(X1 , X2 , . . . , Xn ) C} P {(X1 , X2 , . . . , Xn ) D}
for = 1 . The last inequality states that the power of the likelihood ratio test at
= 1 is at least as much as that corresponding to the rejection region D. The proof
for discrete case is similar, with sums taking places of integrals.
L(0 )
k (X1 , X2 , , Xn ) C W (X) R,
L(1 )
where the interval R is chosen so that the test has the significance level . Generally
speaking, the likelihood ratio helps us to determine the test statistic and the form of
its rejection region.
L(0 )
The likelihood ratio test rejects the null hypothesis = 0 if and only if k,
L(1 )
that is,
{ }
1 n [ ]
exp (Xi 1 ) (Xi 0 ) k
202 i=1
2 2
n
(21 Xi + 12 + 20 Xi 02 ) 202 ln k
i=1
n
n(12 02 ) + 2(0 1 ) Xi 202 ln k
i=1
20 ln k n(1 0 )
2 2 2
X (since 1 > 0 ).
2n(0 1 )
Therefore, the most powerful test having significance level is the one which
has the rejection region
{ } { }
0 z X 0
X 0 + or z .
n 0 / n
(Note that the rejection region found does not depend on the value of 1 ).
If 0 Xi 1 for i = 1, 2, . . . , n, find the form of the most powerful test with signifi-
cance level for testing
H0 : = 2 versus H1 : = 1.
n
The likelihood ratio test rejects H0 if and only if 2n Xi K where K is a positive
i=1
n
constant (or, equivalently, Xi k where k is a positive constant).
i=1
H0 : 0 versus H1 : 1 ,
where 1 is the complement of 0 with respect to (i.e., 1 = /0 ). Let
L(0 )
= .
L( )
Example 5.8. Find the generalized likelihood ratio test for testing
H0 : = 0 versus H1 : = 0
From aforemention example and the similar technique, we can have the following
table:
Test H0 H1
Rejection region p-value
)
{ } (
X 0 x 0
Two-tailed = 0
= 0 z /2 P |Z|
{ / n } ( / n)
X 0 x 0
Left-tailed = 0 or 0 < 0 z P Z
{ / n } ( / n)
X 0 x 0
Right-tailed = 0 or 0 > 0 z P Z
/ n / n
Example 5.9. The standard deviation of the annual incomes of government employ-
ees is $1400. The mean is claimed to be $35,000. Now a sample of 49 employees
has been drawn and their average income is $35,600. At the 5% significance level,
can you conclude that the mean annual income of all government employees is not
$35,000?
Solution 1.
Step 1: The mean ... is not 35,000 can be written as = 35000, while the
mean ... is 35,000 can be written as = 35000. Since the null hypothesis should
include an equality, we consider hypothesis:
X 0 X 35000 X 35000
Z= = = ,
/ n 1400/ 49 200
Solution 2.
Step 1:
H0 : = 35000 versus H1 : = 35000.
Step 2: The test statistic is
X 0 X 35000 X 35000
Z= = = ,
/ n 1400/ 49 200
Example 5.10. The chief financial officer in FedEx believes that including a stamped
self-addressed envelope in the monthly invoice sent to customers will reduce the
amount of time it takes for customers to pay their monthly bills. Currently, cus-
tomers return their payments in 24 days on average, with a standard deviation of 6
days. It was calculated that an improvement of two days on average would cover the
costs of the envelopes (because cheques can be deposited earlier). A random sample
76 5 Hypothesis testing
of 220 customers was selected and stamped self-addressed envelopes were included
in their invoice packs. The amounts of time taken for these customers to pay their
bills were recorded and their mean is 21.63 days. Assume that the corresponding
population standard deviation is still 6 days. Can the chief financial officer conclude
that the plan will be profitable at the 10% significance level?
Solution 1. The plan will be profitable when < 22, and not profitable when
22. Since the null hypothesis should include an equality, we have
Since 0.9147 > 1.282 z0.1 , H0 should not be rejected. The chief financial
officer cannot conclude that the plan is profitable at the 10% significance level.
Solution 2: Consider
where Z follows N(0, 1). Therefore, H0 should not be rejected. The chief financial
officer cannot conclude that the plan is profitable at the 10% significance level.
Example 5.11. Find the generalized likelihood ratio test for testing
H0 : = 0 versus H1 : > 0
Solution. Now
= {( , ) : 0 , > 0} ,
0 = {( , ) : = 0 , > 0} ,
1 = {( , ) : > 0 , > 0} .
and hence,
ln L( , ) n
= 2 (X ),
ln L( , ) n 1 n
= + 3 (Xi )2 .
i=1
1 n
2 = (Xi 0 )2 .
n i=1
This is because is the maximum value of ln L(0 , ), by noting that for all > 0,
1 n ln L( , )
< (Xi )2 > 0,
n i=1
1 n ln L( , )
>
n i=1
(Xi )2
< 0.
Therefore, ( )n
1 ( n)
L(0 ) = L(0 , ) = exp .
2 2
On , the maximum value of L( , ) is L( , ), where (noting that L( , )
decreases with respect to when > X and increases with respect to when
< X) {
0 , if X 0 ;
=
X, if X > 0 ,
and
1 n
2 = (Xi )2 .
n i=1
Therefore, ( )n
1 ( n)
L( ) = L( , ) = exp .
2 2
Thus, we have
1, if X 0 ;
n n/2
( )n ( 2 )n/2
L(0 ) (Xi X)
2
= = = =
L( ) 2
i=1 , if X > 0 .
n
(Xi 0 )2
i=1
78 5 Hypothesis testing
that is,
(X 0 )2 n(X 0 )2
= n k2/n 1,
S2
(Xi X) 2
i=1
or (since X > 0 )
X 0
c,
S/ n 1
where c is the constant (n 1)(k2/n 1). In order that the level of significance
is , that is, ( )
X 0
P( , ) c = for = 0 ,
S/ n 1
we should let c = t ,n1 , since
( )
X 0
P( , ) t ,n1 = P (tn1 t ,n1 ) for = 0
S/ n 1
by Property 4.1(iii). So, the generalized likelihood ratio test has the rejection region
{ }
X 0
t ,n1
S/ n 1
and
1 n
2 = (Xi )2 .
n i=1
Therefore, ( )n
1 ( n)
L(0 ) = L( , ) = exp .
2 2
On , the maximum value of L( , ) is L( , ), where
1 n
= X and 2 = (Xi X)2
n i=1
Therefore, ( )n
1 ( n)
L( ) = L( , ) = exp .
2
2
Thus, we have
1, if X 0 ;
n n/2
( )n ( 2 )n/2
=
L(0 )
=
=
=
(Xi X)2
L( ) 2
ni=1
, if X > 0 .
(Xi 0 )2
i=1
Hence the generalized likelihood ratio remains the same as that in the previous ex-
ample, and so does the rejection region.
From aforemention two examples and the similar technique, we can have the
following table:
Test H0 H1 Rejection
region } p-value
)
{ (
X 0 x 0
Two-tailed = 0 = 0 t /2,n1
P |tn1 |
{ S/ n 1 } ( s/ n 1)
X 0 x 0
Left-tailed = 0 or 0 < 0 t ,n1 P tn1
{S/ n 1 } ( s/ n 1 )
X 0 x 0
Right-tailed = 0 or 0 > 0 t ,n1 P tn1
S/ n 1 s/ n 1
Example 5.13. According to the last census in a city, the mean family annual in-
come was 316 thousand dollars. A random sample of 900 families taken this year
80 5 Hypothesis testing
produced a mean family annual income of 313 thousand dollars and a standard de-
viation of 70 thousand dollars. At the 2.5% significance level, can we conclude that
the mean family annual income has declined since the last census?
Since 1.286 > 1.963 = t0.025,899 , we do not reject H0 . Thus we cannot conclude
that the mean family annual income has declined since the last census at the 2.5%
level of significance.
Example 5.14. Given a random sample of size n from a normal population with
unknown mean and variance, find the generalized likelihood ratio test for testing the
null hypothesis = 0 (0 > 0) against the alternative hypothesis = 0 .
L(0 ) = L( , 0 )
( )n [ ]
1 1 n
= exp 2 (Xi X)2 .
0 2 20 i=1
1 n
= X and 2 = (Xi X)2
n i=1
5.3 Generalized likelihood ratio tests: One-sample case 81
Therefore, ( )n
1 ( n)
L( ) = L( , ) = exp .
2 2
Thus, we have
n
L(0 )
(
2
)n/2 (Xi X)2
n
= = exp i=1 +
L( ) 02 202 2
n n/2 n
i i
(X X)2 (X X)2
n
i=1
= i=1 exp + .
n02 202 2
dg(y)
= ey yey = (1 y)ey .
dy
Since
dg(y)
y < 1 >0
dy
and
dg(y)
y > 1 < 0,
dy
g(y) will be small when y is close to zero or very large. Thus we reject the null
hypothesis = 0 when the value of Y (or nY ) is large or small, that is, the rejection
region of our generalized likelihood ratio test has the rejection region:
{nY K1 } {nY K2 }.
82 5 Hypothesis testing
nS2
Note that nY = . In order that the level of significance is , that is,
02
( ) ( 2 )
nS2 nS
P( , ) K1 + P K2 = for = 0 ,
02 02
we should let K1 = 1
2
/2,n1 and K2 = /2,n1 , since
2
( ) ( )
nS2
P( , ) K1 = P n1
2
1
2
/2,n1 =
02 2
and ( ) ( )
nS2
P( , ) K2 = P n1
2
2 /2,n1 =
02 2
for = 0 by using the fact that nY n1
2 from Property 4.1(ii).
From the aforemention example and the similar technique, we can have the fol-
lowing table:
Test H0 H1
Rejection region p-value
{ } ( ( )
nS2 ns2
Two-tailed = 0 = 0 1 /2,n1 2 min P n1 2 ,
2 2
02
{ } ( )0
nS2 ns2 )
/2,n1
2
P n1 2
2
{ 20 0 )
2
} (
nS ns2
Left-tailed = 0 or 0 < 0 2
1 ,n1 P 2
n1
{0 2 02 )
2
} (
nS ns2
Right-tailed = 0 or 0 > 0 2 ,n1 P n12
2
0 2 0
Example 5.15. One important factor in inventory control is the variance of the daily
demand for the product. A manager has developed the optimal order quantity and
reorder point, assuming that the variance is equal to 250. Recently, the company
has experienced some inventory problems, which induced the operations manager
to doubt the assumption. To examine the problem, the manager took a sample of 25
daily demands and found that s2 = 270.58. Do these data provide sufficient evidence
at the 5% significance level to infer that the management scientists assumption
about the variance is wrong?
ns2 25 270.58
= 25.976.
02 250
Since 10.05/2,251
2 12.401 25.976 39.364 0.05/2,251
2 , we do not reject
H0 . There is not sufficient evidence at the 5% significance level to infer that the
management scientists assumption about the variance is wrong.
We can obtain the interval estimation by using the two-tailed hypothesis testing. For
example, consider hypotheses
H0 : = 0 versus = 0 .
In this section, we assume that there are two populations following N(1 , 12 ) and
N(2 , 22 ) respectively. A sample {Xi , i = 1, 2, . . . , n1 } is taken from the popula-
tion N(1 , 12 ) and a sample {Y j , j = 1, 2, . . . , n2 } is taken from the population
N(2 , 22 ). Assume that these two samples are independent (that is, X1 , X2 , . . . , Xn1 ,
Y1 ,Y2 , . . . ,Yn2 are independent).
We first consider the hypothesis testing for 1 2 when 1 and 2 are known.
84 5 Hypothesis testing
Example 5.16. Assume that 1 and 2 are known. Find the generalized likelihood
ratio for testing
H0 : 1 2 = versus H1 : 1 2 = .
0 = {(1 , 2 ) : 1 2 = } ,
1 = {(1 , 2 ) : 1 2 = } ,
= 0 1 = {(1 , 2 ) : < 1 < , < 2 < } .
On 0 , we have
ln L(1 , 2 ) = ln L(1 , 1 )
1 n1 1 n2
= C
21 i=1
2
(Xi 1 )2 2
22
(Y j 1 + )2 ,
j=1
1 n1 1 n2
ln L(1 , 1 ) = 2 (Xi 1 ) + 2 (Y j 1 + )
1 1 i=1 2 j=1
n1 (X 1 ) n2 (Y 1 + )
= +
12 22
( )
n1 X n2 (Y + ) n1 n2
= 2 + + 1 .
1 22 12 22
n1 X n2 (Y + )
+
12 22
1 = n1 n2 ,
+
12 22
since
5.4 Generalized likelihood ratio tests: Two-sample case 85
1 < 1 ln L(1 , 1 ) > 0,
1
1 > 1 ln L(1 , 1 ) < 0.
1
and 2 = Y maximizes
( )n2 [ ]
n2
1 1
2 2
exp 2
22
(Y j 2 )
2
.
j=1
L(0 )
=
L( )
[ ]
1 n1 [ ] 1 [
n2 ]
= exp 2 (Xi 1 )2 (Xi X)2 2 (Y j 1 + ) (Y j Y )
2 2
21 i=1 22 j=1
[ ]
n1 (X 1 )2 n2 (Y 1 + )2
= exp
212 222
[ ]
= exp C (X Y )2 ,
{ }
Therefore the rejection region should be |X Y | K .
Under H0 , we have
( 2)
X follows N 1 , 1 ,
( n1
)
Y follows N 1 , 22 ,
n2
86 5 Hypothesis testing
X Y
where the test statistic is .
12 22
+
n1 n2
From the aforemention example and the similar technique, we can have the fol-
lowing table:
We second consider the hypothesis testing for 1 2 when 1 and 2 are unknown
but equal.
5.4 Generalized likelihood ratio tests: Two-sample case 87
Example 5.17. Assume that 1 and 2 are unknown but equal to . Find the gener-
alized likelihood ratio for testing
H0 : 1 2 = versus H1 : 1 2 = .
0 = {(1 , 2 , ) : 1 2 = , > 0} ,
1 = {(1 , 2 , ) : 1 2 = , > 0} ,
= 0 1 = {(1 , 2 , ) : > 0} .
On 0 , we have
ln L(1 , 2 , ) = ln L(1 , 1 , )
[ ]
n1 n2
1
= C (n1 + n2 ) ln 2
2 (Xi 1 ) 2
+ (Y j 1 + )
2
,
i=1 j=1
n1 (X 1 ) + n2 (Y 1 + )
ln L(1 , 1 , ) =
1 2
n1 X + n2 (Y + ) n1 + n2
= 1 .
2 2
This implies that the maximum likelihood estimator of 1 is
n1 X + n2 (Y + )
1 = ,
n1 + n2
which does not depend on , since
1 < 1 ln L(1 , 1 , ) > 0,
1
1 > 1 ln L(1 , 1 , ) < 0.
1
since
> ln L( 1 , 1 , ) < 0.
Therefore,
n1 + n2
ln L(0 ) = C (n1 + n2 ) ln .
2
On , we have
[ ]
n1 n2
1
ln L(1 , 2 , ) = C (n1 + n2 ) ln 2
2 (Xi 1 ) 2
+ (Y j 2 ) 2
,
i=1 j=1
n1 (X 1 )
ln L(1 , 2 , ) = ,
1 2
n2 (Y 2 )
ln L(1 , 2 , ) = ,
2 2
[ ]
n1 + n2 1 n1 n2
ln L(1 , 2 , ) =
+ 3
(Xi 1 )2 + (Y j 2 )2 .
i=1 j=1
Hence, by following the same routine as before, we can show that the maximum
likelihood estimators are
1 = X,
2 = Y ,
[ ]
n1 n2
1
=
2
n1 + n2 (Xi X) 2
+ (Y j Y ) 2
.
i=1 j=1
Therefore,
n1 + n2
ln L( ) = C (n1 + n2 ) ln .
2
5.4 Generalized likelihood ratio tests: Two-sample case 89
Note that
n1 n2
(Xi 1 )2 + (Y j 1 + )2
2 i=1 j=1
=
2 n1 n2
(Xi X)2 + (Y j Y )2
i=1 j=1
n1 n2
(Xi X)2 + n1 (X 1 )2 + (Y j Y )2 + n2 (Y 1 + )2
i=1 j=1
= n1 n2
(Xi X)2 + (Y j Y )2
i=1 j=1
n1 (X 1 )2 + n2 (Y 1 + )2
= 1+ n1 n2
(Xi X)2 + (Y j Y )2
i=1 j=1
[ ]2 [ ]2
n2 (X Y ) n1 (Y + X)
n1 + n2
n1 + n2 n1 + n2
= 1+ n1 n2
(Xi X)2 + (Y j Y )2
i=1 j=1
n1 n2
(X Y )2
n1 + n2
= 1+
n1 S12 + n2 S22
(X Y )2
= 1+ ( ) ,
1 1 [ 2 2
]
+ n1 S1 + n2 S2
n1 n2
{ }
where S12 and S22 are the sample variances of {Xi , i = 1, 2, . . . , n1 } and Y j , j = 1, 2, . . . , n2
respectively, Therefore H0 should be rejected when
|X Y |
1 2
is large.
1 2
+ n1 S1 + n2 S2
n1 n2
( ) ( )
2 2
Under H0 , X follows N 1 , and Y follows N 1 , , and thus
( ) n1 n2
2 2
X Y follows N , + , which implies that
n1 n2
90 5 Hypothesis testing
X Y
follows N(0, 1).
1 1
+
n1 n2
n1 S12 n2 S22
Besides, the fact that the two independent random variables and follows
2 2
n1 S12 + n2 S22
n21 1 and n22 1 respectively implies that follows n21 +n2 2 . Therefore,
2
X Y
n11 + n12 X Y
W= / =
n1 S12 + n2 S22 1 1 n1 S12 + n2 S22
(n1 + n2 2) +
2 n1 n2 n1 + n2 2
From the aforemention example and the similar technique, we can have the fol-
lowing table:
Test H0 H1 Rejection
region p-value
X Y
xy
Two-tailed 1 2 = 1 2 = t /2,n +n 2 P |tn +n 2 |
S 1
1 2 1 2
p
1
n1 + n2
s p n1 + n1
1 2
X Y xy
Left-tailed 1 2 = 1 2 < t ,n1 +n2 2 P tn1 +n2 2
S 1
+ 1 s 1
+ 1
p n1 n2 p n1 n2
or 1 2
X Y x y
Right-tailed 1 2 = 1 2 > t ,n1 +n2 2 P tn1 +n2 2
S 1
+ 1 s p n11 + n12
p n1 n2
or 1 2
Remark 5.3. S p is called the pooled sample variance, which is an unbiased estimator
of 2 .
Example 5.18. A consumer agency wanted to estimate the difference in the mean
amounts of caffeine in two brands of coffee. The agency took a sample of 15 500-
5.4 Generalized likelihood ratio tests: Two-sample case 91
gramme jars of Brand I coffee that showed the mean amount of caffeine in these
jars to be 80 mg per jar and the standard deviation to be 5 mg. Another sample of
12 500-gramme jars of Brand II coffee gave a mean amount of caffeine equal to
77 mg per jar and a standard deviation of 6 mg. Assuming that the two populations
are normally distributed with equal variances, check at the 5% significance level
whether the mean amount of caffeine in 500-gramme jars is greater for Brand 1
than for Brand 2.
H0 : 1 2 versus H1 : 1 > 2 ,
As 1.42 < 2.06, we can not reject H0 . Thus, we conclude that the mean amount of
caffeine in 500-gramme jars is not greater for Brand 1 than for Brand 2 at the 5%
significance level.
Example 5.19. Find the generalized likelihood ratio test for hypotheses
H0 : 1 = 2 versus H1 : 1 = 2 .
92 5 Hypothesis testing
Solution. It can be proved (details omitted) that the generalized likelihood ratio is
( )n1 /2
S2
C 12
S2
[ ](n1 +n2 )/2 ,
S12
n1 2 + n2
S2
where C is a constant.
For w > 0 define the function
wn1 /2
G(w) = .
[n1 w + n2 ](n1 +n2 )/2
Then,
n1 n1 + n2
ln G(w) = ln w ln [n1 w + n2 ] ,
2 2
d n1 n1 + n2 n1
ln G(w) =
dw 2w 2 n1 w + n2
n1 n2 (1 w)
= ,
2w [n1 w + n2 ]
which is negative when w > 1 and is positive when w < 1. Therefore, the value
of G(w) will be small when w is very large or very small. Therefore H0 should be
S2
rejected when 12 is large or small.
S2
n1 S12
n1 (n2 1)S12 (n1 1)12
When H0 is true, = follows Fn1 1,n2 1 by Property 4.3.
n2 (n1 1)S22 n2 S22
(n2 1)22
n1 (n2 1)S12
Thus, we let the test statistic be W = , and the rejection region is
n2 (n1 1)S22
Remark 5.4. Recall F ,m,n as the positive real number such that P(X F ,m,n ) =
where X follows Fm,n . Suppose X follows Fm,n . Then 1/X follows Fn,m and
1
F1 ,m,n = ,
F ,n,m
because
5.4 Generalized likelihood ratio tests: Two-sample case 93
( )
1 1
1 = P(F1 ,m,n < X) = P >
F1 ,m,n X
( )
1 1
= = P
F1 ,m,n X
1
= F1 ,m,n = .
F ,n,m
From the aforemention example and the similar technique, we can have the fol-
lowing table:
Example 5.20. A study involves the number of absences per year among union and
non-union workers. A sample of 16 union workers has a sample standard deviation
of 3.0 days. A sample of 10 non-union workers has a sample standard deviation
of 2.5 days. At the 10% significance level, can we conclude that the variance of
the number of days absent for union workers is different from that for nonunion
workers?
Solution. Let all union workers be referred to as population 1 and all non-union
workers be referred to as population 2.
We consider the hypotheses:
H0 : 1 = 2 versus H1 : 1 = 2 ,
H0 : 1 = 2 versus H1 : 1 = 2
Unfortunately, the likelihood ratio method does not always produce a test statistic
with a known probability distribution. Nevertheless, if the sample size is large, we
can obtain an approximation to the distribution of a generalized likelihood ratio.
versus
H1 : i = i,0 for at least one i = 1, 2, . . . , d
and that is the generalized likelihood ratio. Then, under very general conditions,
when H0 is true,
2 ln d d2 as n .
against
H1 : i = i,0 for at least one i = 1, 2, . . . , m,
then d = m.
P(X = ai ) = pi , i = 1, 2, . . . , m,
p1 + p2 + + pm = 1.
versus
H1 : pi = pi,0 for at least one i = 1, 2, . . . , m,
where pi,0 > 0 for i = 1, 2, . . . , m and
versus
H1 : pi = pi,0 for at least one i = 1, 2, . . . , m 1,
where pi,0 > 0 for i = 1, 2, . . . , m 1 and
Then,
( ) ( )
m1 m1 m1
ln L(p1 , p2 , . . . , pm1 ) = lnC + Yi ln pi + n Yi ln 1 pi ,
i=1 i=1 i=1
m1
ln L(p1 , p2 , . . . , pm1 ) Yi
n Yi
i=1
= , i = 1, 2, . . . , m 1,
pi pi m1
1 pi
i=1
96 5 Hypothesis testing
ln L(p1 , p2 , . . . , pm1 )
=0
pi
as pi , i = 1, 2, . . . , m 1. Then,
( )
m1 m1
n Yi Y1 +Y2 + +Ym1 + n Yi
Y1 Y2 Ym1 Yi i=1
= = = = i=1
= ( ) =n
p1 p2 pm1 m1 pi m1
1 pi p1 + p2 + + pm1 + 1 pi
i=1 i=1
and each of these pi s does not depend on the others, the maximum of L(p1 , p2 , . . . , pm1 )
is thus L( p1 , p2 , . . . , pm1 ).
Therefore the generalized likelihood ratio is
pi,0
Ym1
pY1,0 )n pi 1 pi,0
1 Y2
p2,0 pm1,0 1
m1 (
pi,0
i=1 i=1
= ( )nm1 Yi = .
pi m1
1 pi
m1 i=1 i=1
pi
Y
pY11 pY22 pm1
m1
1
i=1
i=1
From ( )
d x x 1 x
x ln = ln + x = ln + 1
dx x0 x0 x x0
and ( ) ( )
d2 x d x 1
x ln = ln = ,
dx2 x0 dx x0 x
we obtain
5.5 Generalized likelihood ratio tests: Large samples 97
x (x x0 )2
x ln = (x x0 ) + + ,
x0 2x0
and therefore,
x (x x0 )2
x ln (x x0 ) + when x x0 .
x0 2x0
Hence, when n ,
m1
1 pi
( ) ( )
m1 m1
pi
2 ln = 2n pi ln
pi,0
+ 2n 1 pi ln
i=1
m1
1 pi,0
i=1 i=1
i=1
[ ] [( ) ( )]
2
m1
( pi pi,0 ) m1 m1
2n ( pi pi,0 ) +
2pi,0
+ 2n 1 pi 1 pi,0
i=1 i=1 i=1
[( ) ( )]2
m1 m1
1 pi 1 pi,0
i=1 i=1
+2n ( )
m1
2 1 pi,0
i=1
m1 m1
(n pi npi,0 )2 m1
= 2n ( pi pi,0 ) + npi,0
+ 2n (pi,0 pi )
i=1 i=1 i=1
[ ( ) ( )]2
m1 m1
n 1 pi n 1 pi,0
i=1 i=1
+ ( )
m1
n 1 pi,0
i=1
[ ( ) ( )]2
m1 m1
n 1 pi n 1 pi,0
m1
(n pi npi,0 )2
i=1 i=1
= + ( )
npi,0 m1
pi,0
i=1
n 1
i=1
Oi = Yi = n pi for i = 1, 2, . . . , m 1,
( )
m1 m1
Om = n Yi = n 1 pi ,
i=1 i=1
98 5 Hypothesis testing
Ei = npi,0 for i = 1, 2, . . . , m 1,
( )
m1
Em = n 1 pi,0 ,
i=1
then,
m
(Oi Ei )2 m
O2 m m m
O2
2 ln Ei
= i 2 Oi + Ei = i 2n + n
i=1 i=1 Ei i=1 i=1 i=1 Ei
m
O2
= Eii n.
i=1
{ }
(Oi Ei )2
m
Note that {2 ln K} is the rejection region, and ,m1
2
i=1 Ei
can serve as an approximate rejection region. Since this is only an approximate
result, it is suggested that all expected frequencies should be no less than 5, so that
the sample is large enough. To meet this rule, some categories may be combined
when to do so is logical.
Example 5.21. A journal reported that, in a bag of m&ms chocolate peanut candies,
there are 30% brown, 30% yellow, 10% blue, 10% red, 10% green and 10% orange
candies. Suppose you purchase a bag of m&ms chocolate peanut candies at a nearby
store and find 17 brown, 20 yellow, 13 blue, 7 red, 6 green and 9 orange candies, for
a total of 72 candies. At the 0.1 level of significance, does the bag purchased agree
with the distribution suggested by the journal?
H0 : the bag purchased agrees with the distribution suggested by the journal,
versus
H1 : the bag purchased does not agree with the distribution suggested by the journal.
Then we have the table below, in which all expected frequencies are at least 5.
Colour Oi Ei Oi Ei
Brown 17 72 30% = 21.6 -4.6
Yellow 20 72 30% = 21.6 -1.6
Blue 13 72 10% = 7.2 5.8
Red 7 72 10% = 7.2 -0.2
Green 6 72 10% = 7.2 -1.2
Orange 9 72 10% = 7.2 1.8
Total 72 72 0
6
O2
2 ln Eii n
i=1
172 + 202132 + 72 + 62 + 92
= + 72
21.6 7.2
6.426 < 9.236 0.1,61
2
.
Alternatively,
6
(Oi Ei )2
2 ln Ei
i=1
(4.6)2 (1.6)2 5.82 (0.2)2 (1.2)2 1.82
= + + + + +
21.6 21.6 7.2 7.2 7.2 7.2
6.426 < 9.236 0.1,61
2
.
Hence we should not reject H0 . At the significance level 10%, we cannot conclude
that the bag purchased does not agree with the distribution suggested by the journal.
Example 5.22. A traffic engineer wishes to study whether drivers have a preference
for certain tollbooths at a bridge during non-rush hours. The number of automobiles
passing through each tollbooth lane was counted during a randomly selected 15-
minute interval. The sample information is as follows.
Tollbooth Lane 1 2 3 4 5 Total
Number of Cars observed 171 224 211 180 214 100
Can we conclude that there are differences in the numbers of cars selecting respec-
tively each of the lanes? Test at the 5% significance level.
versus
All the five expected frequencies equal 1000 5 = 200, which is not less than 5.
Therefore, as the sample is large enough,
5
O2i
2 ln n
i=1 Ei
1712 + 2242 + 2112 + 1802 + 2142
= 1000
200
10.67 9.488 0.05,51
2
.
100 5 Hypothesis testing
Hence, H0 should be rejected. At the significance level 5%, we can conclude that
there are differences in the numbers of cars selecting respectively each of the lanes.
For calculating Ei s, we have to use the maximum likelihood estimate of the un-
known parameters. Then the rejection region is {2 ln K} or, approximately,
{ }
m
(Oi Ei )2
Ei ,m1k .2
i=1
Consider the following joint distribution of two discrete random variables X and Y :
Value of Y
Probability Row sum
b1 bj bc
a1 p1,1 p1, j p1,c p1.
Value of X ai pi,1 pi, j pi,c pi.
ar pr,1 pr, j pr,c pr.
Column sum p.1 p. j p.c 1
We want to test
H0 : X and Y are independent
versus
H1 : X and Y are not independent.
That is, we want to test
versus
Value of Y
Observed frequency Row sum
b1 bj bc
a1 O1,1 O1, j O1,c n1.
Value of X ai Oi,1 Oi, j Oi,c ni.
ar Or,1 Or, j Or,c nr.
Column sum n.1 n. j n.c n
Let be the generalized likelihood ratio. Then it can be proved that
( )
r c O2 r c O2
r c
(Oi, j Ei, j )2
2ln = n = n
i, j i, j
1
i=1 j=1 Ei, j i=1 j=1 Ei, j i=1 j=1 ni n j
ni n j
where Ei, j = is the expected frequency corresponding to Oi, j when H0 is true,
n
i = 1, 2, . . . , r and j = 1, 2, . . . , c. The rejection region is approximately
{ }
r c
(Oi, j Ei, j )2
Ei, j
2 ,(r1)(c1) ,
i=1 j=1
where
Example 5.23. Suppose we draw a sample of 360 students and obtain the following
information. At the 0.01 level of significance, test whether a students ability in
mathematics is independent of the students interest in statistics.
102 5 Hypothesis testing
Ability in Math
sum
Low Average High
Low 63 42 15 120
Interest in Statistics Average 58 61 31 150
High 14 47 29 90
Sum 135 150 75 360
Solution. Consider hypotheses:
versus
H1 : ability in mathematics and interest in statistics are not independent (are related).
The table below shows the expected frequencies (where, for example, 45 = 120
135 360 and 50 = 120 150 360).
Ability in Math
sum
Low Average High
Low 45 50 25 120
Interest in Statistics Average 56.25 62.5 31.25 150
High 33.75 37.5 18.75 90
Sum 135 150 75 360
All expected frequencies are at least 5. Therefore, as the sample is large enough,
( ) ( )
r c O2
632 422 292
n
i, j
1 = 360 + ++ 1
i=1 j=1 ni n j 120 135 120 150 90 75
32.140 13.277 0.01,(31)(31)
2
.
Hence, at the significance level 1%, we reject H0 and conclude that there is a re-
lationship between a students ability in mathematics and the students interest in
statistics.
Alternatively, the value of the test statistic equals
3 3
(Oi, j Ei, j )2 (63 45)2 (42 50)2 (29 18.75)2
Ei, j
=
45
+
50
+ +
18.75
i=1 j=1