0% found this document useful (0 votes)
37 views9 pages

Linear Model Methodology

The document discusses generalized least squares estimation (GLSE) when the variance-covariance matrix of the error term (V) is known to be σ2V, where V is positive definite. It shows that GLSE can be obtained by transforming the model so that the error term has an identity covariance matrix, then using ordinary least squares. It also discusses restricted least squares estimation when the parameter vector β is subject to known linear restrictions, and maximum likelihood estimation when the error term is normally distributed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views9 pages

Linear Model Methodology

The document discusses generalized least squares estimation (GLSE) when the variance-covariance matrix of the error term (V) is known to be σ2V, where V is positive definite. It shows that GLSE can be obtained by transforming the model so that the error term has an identity covariance matrix, then using ordinary least squares. It also discusses restricted least squares estimation when the parameter vector β is subject to known linear restrictions, and maximum likelihood estimation when the error term is normally distributed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Full-Rank Linear Models 137

6.3 Generalized Least-Squares Estimation


In this section, we discuss again the estimation of in model (6.5), but under
a more general setup concerning the variancecovariance matrix of . Here,
we consider that Var() = 2 V, where V is a known positive definite matrix.
Estimation of in this case can be easily reduced to the case discussed in
Section 6.1. Multiplying both sides of (6.5) on the left by V 1/2 , we get

Y v = X v + v , (6.33)

where Y v = V 1/2 Y, X v = V 1/2 X, and v = V 1/2 . Note that X v is of full


column rank and that E(v ) = 0, Var(v ) = V 1/2 (2 V)V 1/2 = 2 In . The
OLS estimator of in model (6.33) is therefore given by
 1 
v = X v X v XvYv

1
= X  V 1 X X  V 1 Y. (6.34)

We call v the generalized least-squares estimator (GLSE) of for model (6.5).


This estimator is unbiased for and its variancecovariance matrix is


1

1
Var v = X  V 1 X X  V 1 2 V V 1 X X  V 1 X

1
= 2 X  V 1 X . (6.35)

Applying the GaussMarkov Theorem (Theorem 6.1) to model (6.33) we


conclude that c v = c (X  V 1 X)1 X  V 1 Y is the BLUE of c , where c is
a given nonzero constant vector. In addition, using Corollary 6.1, if C is a
vector linear function of , where C is a given matrix of order q p and rank
q ( p), then Cv is the BLUE of C. In particular, v is the BLUE of .

6.4 Least-Squares Estimation under Linear


Restrictions on
The parameter vector in model (6.5) may be subject to some linear restric-
tions of the form
A = m, (6.36)
where
A is a known matrix of order r p and rank r ( p)
m is a known vector
138 Linear Model Methodology

For example, in a simple linear regression model, Y = 0 + 1 x + ,


estimation of the slope 1 may be needed when 0 = 0, that is, for a model
with a zero Y-intercept. Also, in a multiple linear regression model, such as
Y = 0 + 1 x1 + 2 x2 + , the mean response, (x) = 0 + 1 x1 + 2 x2 , may
be set equal to zero when x1 = 1, x2 = 3. In this case, 0 , 1 , 2 are subject
to the linear restriction, 0 + 1 + 32 = 0.
In this section, we consider least-squares estimation of when is subject
to linear restrictions of the form given in (6.36). We make the same assump-
tions on  as in Section 6.1, namely, E() = 0 and Var() = 2 In .
This particular type of estimation can be derived by minimizing S()
in (6.6) under the equality restrictions (6.36). A convenient way to do this
is to use the method of Lagrange multipliers (see, for example, Khuri, 2003,
Section 7.8). Consider the function,

T(, ) = S() +  (A m), (6.37)

where is a vector of r Lagrange multipliers. Differentiating T(, ) with


respect to and equating the derivative to zero, we obtain

2 X  Y + 2 X  X + A = 0. (6.38)

Solving (6.38) with respect to and denoting this solution by r , we get


 
  1  1 
r = X X XY A . (6.39)
2

Substituting r for in (6.36), we obtain


 
 
1  1 
A XX X Y A = m.
2

Solving this equation for , we get


 1  1   1 
= 2 A XX A A XX X Ym . (6.40)

From (6.39) and (6.40) we then have the solution

 1   1    1  1   1 
r = X  X X Y XX A A XX A A XX X Ym
 1    1  1

= X  X A A XX A A m , (6.41)

where is the OLS estimator given in (6.9). This solution is called the restricted
least-squares estimator of . It is easy to see that r satisfies the equality
Full-Rank Linear Models 139

restrictions in (6.36). Furthermore, S() attains its minimum value, S(r ),


over the parameter space constrained by (6.36) when = r . It can also be
verified that






S r = Y X Y X + r X  X r



= SSE + r X  X r . (6.42)

Hence, S(r ) > SSE since equality is attained if and only if = r , which is
not possible. Using (6.41), formula (6.42) can also be written as


  1  1

S r = SSE + A m A X  X A A m .

The assertion that S() attains its minimum value at r in the constrained
parameter space can be verified as follows: Let be any vector in the con-
strained parameter space. Then, A = m = Ar . Hence, A(r ) = 0.
Therefore,

S() = (Y X) (Y X)



= Y X r + X r X Y X r + X r X





= Y X r Y X r + 2 Y X r X r X



+ X r X X r X .

But, from (6.41) we have





 1    1  1
Y X r X r X = Y X + X X  X A A XX A



A m X r

  1  1   1 
= A m A X  X A A XX XX

= 0, since A r = 0.

It follows that




S () = S r + r X  X r .

Hence, S() S(r ), and equality is attained if and only if = r .


140 Linear Model Methodology

6.5 Maximum Likelihood Estimation


This type of estimation requires that particular assumptions be made con-
cerning the distribution of the error term in model (6.5). In this section, we
assume that  is normally distributed as N(0, 2 In ). Hence, Y N(X, 2 In ).
Then, the corresponding likelihood function, which, for a given value, y, of
Y, is the same as the density function of Y, but is treated as a function of
and 2 and is therefore given by
 
1 1 
L(, 2 , y) = exp (y X) (y X) . (6.43)
(2 2 )n/2 2 2

By definition, the maximum likelihood estimates (MLE) of and 2 are those


values of and 2 that maximize the likelihood function, or equivalently, the
natural logarithm of L, namely

n n
1
l , 2 , y = log(2 ) log 2 (y X) (y X).
2 2 2 2

To find the MLE of and 2 , we proceed as follows: We first find the sta-
tionary values of and 2 for which the partial derivatives of l(, 2 , y) with
respect to and 2 are equal to zero. The next step is to verify that these
values maximize l(, 2 , y). Setting the partial derivatives of l(, 2 , y) with
respect to and 2 to zero, we get
 
l , 2 , y 1  
= 2 2 X  y + 2 X  X
2
=0 (6.44)
 
l , 2 , y n 1
= 2 + (y X) (y X)
2 2 2 4
= 0. (6.45)

Let and 2 denote the solution of equations 6.44 and 6.45 for and 2 ,
respectively. From (6.44) we find that

= (X  X)1 X  Y, (6.46)

which is the same as , the OLS estimator of in (6.9). Note that Y was used
in place of y in (6.46) since the latter originated from the likelihood function
in (6.43) where it was treated as a mathematical variable. In formula (6.46),
however, Y is treated as a random vector since it is data dependent. From
(6.45) we get
1


2 = Y X Y X . (6.47)
n
Full-Rank Linear Models 141

We recall that 2 = MSE is an unbiased estimator of 2 [see property 6.2.1(c)].


The estimator 2 , however, is not unbiased for 2 .
Let us now verify that and 2 are indeed maximal values for l(, 2 , y)
with respect to and 2 . For this purpose, we consider the matrix of second-
order partial derivatives of l(, 2 , y) with respect to and 2 . This is the
Hessian matrix and is given by

 

l ,2 ,y l ,2 ,y

 2

M= 
 
 .
l ,2 ,y l ,2 ,y



2 2 2

Evaluating the matrix M at = , and 2 = 2 , we get




12 X  X 1
2 4
2 X 
y + 2 X 
X
M=  



. (6.48)
1
2 y  X + 2 X X n
1
y X y X
2 4 2 4 6

Making use of (6.44) and (6.47) in (6.48), we obtain


 
12 X  X 0
M= , (6.49)
0 2 n4

which is clearly negative definite. Hence, l(, 2 , y) has a local maximum at


= , 2 = 2 (see, for example, Khuri, 2003, Corollary 7.7.1). Since this is
the only local maximum, it must also be the absolute maximum. Hence, and
2 are the MLE of and 2 , respectively. The maximum value of L(, 2 , y)
in (6.43) is

1 n/2
max L , 2 , y =  n/2 e . (6.50)
, 2
2 2

6.5.1 Properties of Maximum Likelihood Estimators


np
SSE
We have that = and 2 = =
n MSE . On the basis of the properties
n
 2
given in Section 6.2.1.1, N(, (X X)1 ), 2 n 2np , and and 2 are
2

independent. In addition, and 2 have two more properties, namely, suffi-


ciency and completeness. In order to understand these properties, the following
definitions are needed:

Definition 6.1 Let Y be a random vector whose distribution depends on a


parameter vector, , where is some parameter space. A statistic U(Y)
is a sufficient statistic for if the conditional distribution of Y given the value
of U(Y) does not depend on .
142 Linear Model Methodology

In practice, the determination of sufficiency is more easily accom-


plished by using the following well-known theorem in statistical inference,
namely, the Factorization Theorem (see, for example, Casella and Berger, 2002,
Theorem 6.2.6).
Theorem 6.2 (Factorization Theorem) Let g(y, ) denote the density function
of the random vector Y. A statistic U(Y) is a sufficient statistic for if
and only if g(y, ) can be written as

g(y, ) = g1 (y) g2 (u(y), )

for all , where g1 is a function of y only and g2 is a function of u(y)


and .
Definition 6.2 Let Y be a random vector with the density function g(y, ),
which depends on a parameter vector, . Let F denote the family of
distributions {g(y, ), }. This family is said to be complete if for every
function h(Y) for which

E[h(Y)] = 0, for all ,

then h(Y) = 0 with probability equal to 1 for all .


Note that completeness is a property of a family of distributions, not of a
particular distribution. For example, let Y N(, 1), then
 
1 1 2
g(y, ) = exp (y ) , < y < , (6.51)
2 2

where < < . Let h(Y) be a function such that E[h(Y)] = 0 for all .
Then,
 
1 

1
h(y)exp (y )2 dy = 0, < < ,
2 2

which implies that


h(y)ey
2 /2
ey dy = 0, < < . (6.52)

The left-hand side of (6.52) is the two-sided Laplace transformation of the


function h(y)ey /2 [see Chapter III in Zemanian (1987)]. Since the two-sided
2

Laplace transformation of the zero function is also equal to zero, we con-


clude that
h(y)ey
2 /2
= 0, (6.53)

and hence h(y) = 0 almost everywhere (with respect to Lebesgue measure).


This assertion is based on the uniqueness property of the two-sided Laplace
Full-Rank Linear Models 143

transformation [see Theorem 3.5.2 in Zemanian (1987, p. 69)]. We thus have


P[h(Y) = 0] = 1, which indicates that the family of normal distributions,
N(, 1), is complete.
We can clearly see that we would not have been able to conclude from
(6.52) that h(y) = 0 if had just a fixed value. This explains our earlier
remark that completeness is a property of a family of distributions, but not
of a particular distribution. For example, if h(Y) = Y and Y N(0, 1), then
having E[h(Y)] = E(Y) = 0 does not imply that h(Y) = 0.

Definition 6.3 Let U(Y) be a statistic whose distribution belongs to a family


of distributions that is complete. Then, U(Y) is said to be a complete statistic.
For example, if Y is the sample mean of a sample of size n from a normal
distribution N(, 1), < < , then Y N(, n1 ). Since this family of
distributions is complete, as was seen earlier, we conclude that Y is a complete
statistic.
The completeness of the family of normal distributions N(, 1) can be
derived as a special case using a more general family of distributions called
the exponential family.

Definition 6.4 Let F = {g(y, ), } be a family of density functions (or


probability mass functions) such that


k
g(y, ) = (y) c() exp i ()ti (y) , (6.54)
i=1

where (y) 0 and t1 (y), t2 (y), . . . , tk (y) are real-valued functions of y only,
and c() 0 and 1 (), 2 (), . . . , k () are real-valued functions of
only. Then, F is called an exponential family.
Several well-known distributions belong to the exponential family. These
include the normal, gamma, and beta distributions, among the continuous
distributions; and the binomial, Poisson, and negative binomial, among the
discrete distributions. For example, for the family of normal distributions,
N(, 2 ), we have
 
1 1
g(y, ) = exp 2 (y )2 , < < , > 0, (6.55)
2 2 2
   
1 2 y2 y
= exp 2 exp 2 + 2 .
2 2 2 2

Comparing this with


the expression

in (6.54), we see that = (, 2 ) , (y) = 1,
2
c() = 1 2 exp 22 , 1 () = 2
1
2 , t1 (y) = y , 2 () =
2
2
, and
2
t2 (y) = y.
144 Linear Model Methodology

The exponential family has several nice properties. One of these


properties is given by the following well-known theorem (see, for exam-
ple, Arnold, 1981, Theorem 1.2, p. 2; Graybill, 1976, Theorem 2.7.8, p. 79;
Wasan, 1970, Theorem 2, p. 64):
Theorem 6.3 Consider the exponential family defined in formula (6.54).
Let () = [1 (), 2 (), . . . , k ()] and t(y) = [t1 (y), t2 (y), . . . , tk (y)] .
Then, t(Y) is a complete and sufficient statistic provided that the set
{(), } contains a nondegenerate k-dimensional rectangle (i.e., has a
nonempty interior).
After this series of definitions and theorems, we are now ready to show
that the maximum likelihood estimators, and 2 , have the properties of
sufficiency and completeness.
Theorem 6.4 Let Y N(X, 2 In ). Then, the maximum likelihood estimators
of and 2 , namely, = (X  X)1 X  Y, and
1  1 
2 = Y  In X X  X X Y, (6.56)
n
are complete and sufficient statistics for and 2 .
Proof. The density function, g(y, ), of Y is the same as the likelihood func-
tion in (6.43), where = ( , 2 ) . We can then write
 
1 1 
g(y, ) =  n/2 exp (y X) (y X)
2 2 2 2
 
1 1


= n/2 exp 2 y X y X
2 2 2




+ X X
 


1 1 2 
= n/2 exp 2 n + X X . (6.57)
2 2 2

We note that the right-hand side of (6.57) is a function of 2 , , and the ele-
ments of . Hence, by the Factorization Theorem (Theorem 6.2), the statistic

( , 2 ) is sufficient for [the function g1 in Theorem 6.2, in this case, is
identically equal to one, and the function g2 is equal to the right-hand side
of (6.57)].
Now, to show completeness, let us rewrite (6.57) as
 
1 1  
g(y, ) =  n/2 exp X X
2 2 2 2
 
1 
1
exp 2 n2 + X  X + 2  X  X . (6.58)
2
Full-Rank Linear Models 145

By comparing (6.58) with (6.54) we find that g(y, ) belongs to the exponential
family with k = p + 1,
(y) = 1,
 
1 1
c() =  n/2 exp 2  X  X ,
2 2 2
1
1 () = 2 ,
2

t1 (y) = n 2
+ X  X , (6.59)
1
2 () = 2  ,

t2 (y) = X  X . (6.60)
Furthermore, the set
 
() = 1 (), 2 ()
 
1 1  
= 2, 2 ,
2
is a subset of a (p + 1)-dimensional Euclidean space with a negative first
coordinate, and this subset has a nonempty interior. Hence, by Theorem 6.3,
t(Y) = [t1 (Y), t2 (Y)] is a complete statistic. But, from (6.59) and (6.60) we can
solve for and 2 in terms of t1 (Y) and t2 (Y), and we obtain,
 1
= X  X t2 (Y),
1  1      1
2 = t1 (Y) t2 (Y) X  X XX XX t2 (Y)
n
1  1
= t1 (Y) t2 (Y) X  X t2 (Y) .
n

It follows that ( , 2 ) is a complete statistic (any invertible function of a
statistic with a complete family has a complete family; see Arnold, 1981,

Lemma 1.3, p. 3). We finally conclude that ( , 2 ) is a complete and sufficient
statistic for ( , 2 ) .
Corollary 6.2 Let Y N(X, 2 In ), where X is of order np and rank p (< n).
Then, = (X  X)1 X  Y, and
1  1 
MSE = Y  In X X  X X Y,
np

are complete and sufficient statistics for and 2 .


The completeness and sufficiency of and MSE in Corollary 6.2, combined
with their being unbiased for and 2 , give these estimators a certain optimal
property, which is described in the next theorem.

You might also like