0% found this document useful (0 votes)

1 views313 pages

Linear Model Recap 2

The document discusses the estimation of parametric functions in linear models, focusing on the conditions for estimability and the existence of linear unbiased estimators (LUEs). It introduces concepts such as the Best Linear Unbiased Estimator (BLUE), estimation space, and error space, along with algorithms for finding the BLUE from a LUE. Examples illustrate the application of these concepts in linear models, including the use of dummy variable regression models.

Uploaded by

Soumyadeep Majumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views313 pages

Linear Model Recap 2

Uploaded by

Soumyadeep Majumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 313

Review of Linear Models II

Presidency University

February, 2025
Estimation

I A parametric function or a linear parametric function is a linear

Xp
T
combination of parameters c β = ci βi . A parametric
i=1
function c T β is called estimable if there exists a a ∈ Rn such
that E (aT y ) = c T β, that is there exists a linear unbiased
estimator (LUE) of c T β.
Estimation

I A parametric function or a linear parametric function is a linear

I Any parametric function c T β is estimable iff c T ∈ R(X ) (or

equivalently c T ∈ C(X T ) or equivalently c T ∈ C(X T X )).
Two Corollaries

I Any estimable parametric function is of the form aT X β for

some a ∈ Rn .
Two Corollaries

I Any estimable parametric function is of the form aT X β for

some a ∈ Rn .

I If Rank(Xn×p ) = p, then all parametric functions are

estimable.
Two Corollaries

I Any estimable parametric function is of the form aT X β for

some a ∈ Rn .

I If Rank(Xn×p ) = p, then all parametric functions are

estimable.

I We have
!
X
p = Rank(Xn×p ) ≤ Rank ≤ p.
cT (n+1)×p

!
X
⇒ Rank(X ) = Rank whatever c T .
cT (n+1)×p

I This implies any parametric function c T β is estimable.

An important result

I Let c T β be estimable and aT y be any LUE of c T β. Then

there exists an a∗ ∈ C(X )
T
I such that a∗ y is also a LUE of c T β.
T
I Var (a∗ y ) is minimum among all LUE’s.
I choice of a∗ is unique.
An important result

I Let c T β be estimable and aT y be any LUE of c T β. Then

there exists an a∗ ∈ C(X )
T
I such that a∗ y is also a LUE of c T β.
T
I Var (a∗ y ) is minimum among all LUE’s.
I choice of a∗ is unique.
I For an estimable parametric function c T β, among the class of
all linear unbiased estimators, there exist one with minimum
variance. The choice of such linear unbiased estimator with
minimum variance is unique.
An important result

I Let c T β be estimable and aT y be any LUE of c T β. Then

I Summarizing we get, for any estimable parametric function

c T β, there exists a unique linear unbiased estimator. This is
called the Best Linear Unbiased Estimator (BLUE) of c T β.
Estimation space and Error space
I The column space C(X ) of the design matrix X is called the
estimation space. Further a linear function w T y is called an
error function or linear zero function (LZF) if it is an
unbiased estimator of 0, that is, if

E (w T y ) = 0 for all β.

The vector space spanned by the error functions is called the

error space.
Estimation space and Error space
I The column space C(X ) of the design matrix X is called the
estimation space. Further a linear function w T y is called an
error function or linear zero function (LZF) if it is an
unbiased estimator of 0, that is, if

E (w T y ) = 0 for all β.

The vector space spanned by the error functions is called the

error space.

I It can be shown that the orthocomplement space C(X )⊥ of

C(X ) in n dimensional vector space Rn is the error space.
Estimation space and Error space
I The column space C(X ) of the design matrix X is called the
estimation space. Further a linear function w T y is called an
error function or linear zero function (LZF) if it is an
unbiased estimator of 0, that is, if

E (w T y ) = 0 for all β.

The vector space spanned by the error functions is called the

error space.

I It can be shown that the orthocomplement space C(X )⊥ of

C(X ) in n dimensional vector space Rn is the error space.

I Thus for a linear model with design matrix Xn×p , if

Rank(X ) = r (≤ p), then there exists n − r such linearly
independent choices of w ∈ C(X )⊥ . As such there exists n − r
linearly independent error functions.
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly

independent error functions.
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly

independent error functions.

I Then find any LUE aT y of c T β.

Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly

independent error functions.

I Then find any LUE aT y of c T β.

I Then find n − r linearly independent linear zero functions, say,

e1 , e2 , ..., en−r where ei = wiT y .
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly

independent error functions.

I Then find any LUE aT y of c T β.

I Then find n − r linearly independent linear zero functions, say,

e1 , e2 , ..., en−r where ei = wiT y .

I Then write the BLUE as

T X
a∗ y = aT y + `i wiT y
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly

independent error functions.

I Then find any LUE aT y of c T β.

I Then find n − r linearly independent linear zero functions, say,

e1 , e2 , ..., en−r where ei = wiT y .

I Then write the BLUE as

T X
a∗ y = aT y + `i wiT y

I Find `i , for each i, by imposing the orthogonality restriction that

cov (a∗T y , wiT y ) = 0.
I If further the error functions are mutually orthogonal then for all i,
0 0
cov (a y , wi y )
`i = − 0
V (wi y )
Example

I Consider the linear model

µ1 + µ2 + µ3 + 1 = 5.1

µ1 + µ2 + µ3 + 2 = 8.2
µ1 − µ2 + 3 = 4.9
−µ1 + µ2 + 4 = 3.1
Example

I Consider the linear model

µ1 + µ2 + µ3 + 1 = 5.1

µ1 + µ2 + µ3 + 2 = 8.2
µ1 − µ2 + 3 = 4.9
−µ1 + µ2 + 4 = 3.1

I Thus here the design matrix is

 
1 1 1
1 1 1
X = 1 −1 0


−1 1 0
Example

I µ2 is not estimable as then c T = (0, 1, 0) and

X
rank 6= rank(X ) = 2
cT

but µ1 + µ2 + µ3 is estimable.
Example

I µ2 is not estimable as then c T = (0, 1, 0) and

X
rank 6= rank(X ) = 2
cT

but µ1 + µ2 + µ3 is estimable.

I A LUE of µ1 + µ2 + µ3 is y1 = aT y where aT = (1, 0, 0, 0).

Next we try to find the BLUE of µ1 + µ2 + µ3
Example

I µ2 is not estimable as then c T = (0, 1, 0) and

X
rank 6= rank(X ) = 2
cT

but µ1 + µ2 + µ3 is estimable.

I A LUE of µ1 + µ2 + µ3 is y1 = aT y where aT = (1, 0, 0, 0).

Next we try to find the BLUE of µ1 + µ2 + µ3

I We need to find the independent error functions. Since

Rank(X ) = 2, there will be 4-2=2 linearly independent error
functions.
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the

independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the

independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y

I The BLUE can be written as

a∗T y = aT y + `1 w1T y + `2 w2T y

Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the

independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y

I The BLUE can be written as

a∗T y = aT y + `1 w1T y + `2 w2T y

I Since here the error functions are mutually orthogonal we get

cov (aT y ,w T y )
`i = − V (w T y )i and hence we have
i
−cov (y1 ,y1 −y2 ) −cov (y1 ,y3 +y4 )
`1 = V (y1 −y2 ) = − 21 and `2 = v (y3 +y4 ) = 0.
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the

independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y

I The BLUE can be written as

a∗T y = aT y + `1 w1T y + `2 w2T y

I Since here the error functions are mutually orthogonal we get

cov (aT y ,w T y )
`i = − V (w T y )i and hence we have
i
−cov (y1 ,y1 −y2 ) −cov (y1 ,y3 +y4 )
`1 = V (y1 −y2 ) = − 21 and `2 = v (y3 +y4 ) = 0.

I Thus the BLUE is y1 − 12 (y1 − y2 ) = 12 (y1 + y2 ) = 6.65

Example: Dummy variable regression model

I Consider the linear model with one factor covariate having k levels
A1 , A2 , ..., Ak .
Example: Dummy variable regression model

I Consider the linear model with one factor covariate having k levels
A1 , A2 , ..., Ak .

I Suppose we work with the model

y = α + β1 x1 + ... + βk−1 xk−1 +

where xi is the dummy variable for the i th level of the factor.

I Suppose for simplicity we arrange yi ’s in such a manner that

y1 , y2 , ..., yn1 receives A1

yn1 +1 , yn1 +2 , ..., yn1 +n2 receives A2

..
.
yn1 +n2 +...+nk−1 +1 , yn1 +n2 +...+nk−1 +2 , ..., yn1 +n2 +...+nk−1 +nk receives Ak .
Example (Contd.)

I A better and simple way of thinking is we have k groups where

observations in each group receives one particular level Ai .
Example (Contd.)

I A better and simple way of thinking is we have k groups where

observations in each group receives one particular level Ai .

I From the above we find that there will be ni observations in

i th group.
Example (Contd.)

I A better and simple way of thinking is we have k groups where

observations in each group receives one particular level Ai .

I From the above we find that there will be ni observations in

i th group.

I Let us, for further simplicity, introduce the notation that yij
denotes the j th observation in i th group where j = 1, 2, .., ni
and i = 1, 2, ..., k.
I Then the design matrix X is of the form
k columns
z }| {
1 1 0 0 0
 
1 1 0 0 0
 
 
 
1 1 0 0 0
 
− − − − − − − −
 
1 0 1 0 0
 
1 0 1 0 0
 
 
   
1 0 1 0 0 1n1 B1
 
− − − − − − − −  1n2 B2 

 
X =

=
 


− − − − − − − − 1nk−1 Bk−1 
 
1 0 0 0 1 1nk 0

1 0 0 0 1
 
 
 
1 0 0 0 1
 
− − − − − − − −
 
1 0 0 0 0
 
1 0 0 0 0
 
 
1 0 0 0 0
I Further recall that X is of full column rank.
I Further recall that X is of full column rank.

I Note that here the parameter vector is

θ = (α, β1 , β2 , ..., βk−1 ).
I Further recall that X is of full column rank.

I Note that here the parameter vector is

θ = (α, β1 , β2 , ..., βk−1 ).

I Thus any parametric function c T θ is estimable.

I Further recall that X is of full column rank.

I Note that here the parameter vector is

θ = (α, β1 , β2 , ..., βk−1 ).

I Thus any parametric function c T θ is estimable.

I Suppose we want to estimate α.

I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent

error functions.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent

error functions.

I For each group we shall get ni − 1 linearly independent error

functionsPcorresponding ni − 1 rows of Bi and 0 in X , so that
we have (ni − 1) = n − k linearly independent error
functions altogether.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent

error functions.

I For each group we shall get ni − 1 linearly independent error

functionsPcorresponding ni − 1 rows of Bi and 0 in X , so that
we have (ni − 1) = n − k linearly independent error
functions altogether.

I For simplicity, let us just denote the error functions from i th

group as eij , j = 1, 2, ..., ni − 1.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent

error functions.

I For each group we shall get ni − 1 linearly independent error

functionsPcorresponding ni − 1 rows of Bi and 0 in X , so that
we have (ni − 1) = n − k linearly independent error
functions altogether.

I For simplicity, let us just denote the error functions from i th

group as eij , j = 1, 2, ..., ni − 1.

I Then the BLUE will be

i −1
k nX
X
yk1 + ìj eij .
i=1 j=1
Example (Contd.)
I The coefficients ìj are determined by the restrictions
i −1
k nX
X
cov (yk1 + ìj eij , eij ) = 0.
i=1 j=1
Example (Contd.)
I The coefficients ìj are determined by the restrictions
i −1
k nX
X
cov (yk1 + ìj eij , eij ) = 0.
i=1 j=1

I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1

I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
I These are error functions which are linearly independent but
not orthogonal.
I Since cov (ykj , yij 0 ) = 0 for any i 6= k and any j and j 0 , we get
that
X i −1
k nX
cov (yk1 + `ij eij , eij ) = 0
i=1 j=1
boils down to
k −1
nX
cov (yk1 + `kj ekj , ekj ) = 0.
j=1
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1

linearly independent error functions to be
ykj − yk1 , j = 1, 2, ..., nk but then they will not be mutually
orthogonal and we shall have simultaneous equations involving
`kj ’s which are difficult to handle.
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1

I Instead we directly aim to construct nk − 1 mutually

orthogonal error functions from nk observations
yk1 , yk2 , ..., yknk .
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1

I Instead we directly aim to construct nk − 1 mutually

orthogonal error functions from nk observations
yk1 , yk2 , ..., yknk .

I This is same as constructing a nk − 1 × nk matrix with

mutually orthogonal rows (which when pre-multiplied with
(yk1 , ..., yknk ) will give the error functions).
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1

I Instead we directly aim to construct nk − 1 mutually

orthogonal error functions from nk observations
yk1 , yk2 , ..., yknk .

I This is same as constructing a nk − 1 × nk matrix with

mutually orthogonal rows (which when pre-multiplied with
(yk1 , ..., yknk ) will give the error functions).

I But how do we catch hold of such a matrix?

Example (Contd.)

I A good point of reference is Helmert’s orthogonal matrix

whose
√1 √1 ··· ··· √1
 
n n n
 √1 −1
 2
√
2
0 0 ··· 0  
1 1 −2
 √ √ √ 0 ··· 0 
 
 6 6 6 
 
 
1 1 1 −(n−1)
√ √ ··· √ √
n(n−1) n(n−1) n(n−1) n(n−1)
Example (Contd.)

I A good point of reference is Helmert’s orthogonal matrix

whose
√1 √1 ··· ··· √1
 
n n n
 √1 −1
 2
√
2
0 0 ··· 0  
1 1 −2
 √ √ √ 0 ··· 0 
 
 6 6 6 
 
 
1 1 1 −(n−1)
√ √ ··· √ √
n(n−1) n(n−1) n(n−1) n(n−1)

I The idea is that from this n × n matrix we can simply ignore

the first row and consider other rows with choice of n as
n = nk .
Example (Contd.)
I Thus the error functions are
1
ek1 = √ (yk1 − yk2 )
2
1
ek2 = √ (yk1 + yk2 − 2yk3 )
6
and in general
1
ekj = p (yk1 + yk2 + ... + ykj − jyk(j+1) )
j(j + 1)
j
1 hX i
=p yki − jyk(j+1) .
j(j + 1) i=1
Example (Contd.)
I Thus the error functions are
1
ek1 = √ (yk1 − yk2 )
2
1
ek2 = √ (yk1 + yk2 − 2yk3 )
6
and in general
1
ekj = p (yk1 + yk2 + ... + ykj − jyk(j+1) )
j(j + 1)
j
1 hX i
=p yki − jyk(j+1) .
j(j + 1) i=1

I Then we can get `kj as

cov (yk1 , ekj )
`kj = −
Var (ekj )
Example (Contd.)

I Clearly
1
Var (ekj ) = [jσ 2 + j 2 σ 2 ] = σ 2 .
j(j + 1)
and
1 σ2
Cov (yk1 , ekj ) = p , Var (yk1 ) = p
j(j + 1) j(j + 1)
Example (Contd.)

I Clearly
1
Var (ekj ) = [jσ 2 + j 2 σ 2 ] = σ 2 .
j(j + 1)
and
1 σ2
Cov (yk1 , ekj ) = p , Var (yk1 ) = p
j(j + 1) j(j + 1)

I Combining we get
−1
`kj = p .
j(j + 1)
Example (Contd.)
I This means the BLUE is
k −1
nX
yk1 + `kj ekj
j=1

k −1
nX j
1 hX i
= yk1 − yki − jyk(j+1) .
j(j + 1)
j=1 i=1
Example (Contd.)
I This means the BLUE is
k −1
nX
yk1 + `kj ekj
j=1

k −1
nX j
1 hX i
= yk1 − yki − jyk(j+1) .
j(j + 1)
j=1 i=1

I With little algebraic manipulation, we can show that this is

equal to
nk
1 X
ykj
nk
j=1

which is the mean of the observations receiving the k th level of

the factor.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
I We could have started with any other LUE of α, for example,
any ykj will be a LUE.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
I We could have started with any other LUE of α, for example,
any ykj will be a LUE.
I The same process can be applied to find the BLUE of other
parameters βi , infact by a similar process we can show that
1 X
β̂i = yij − α̂
ni
where n1i yij is the mean of the observations belonging to i th
P
group.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
I We could have started with any other LUE of α, for example,
any ykj will be a LUE.
I The same process can be applied to find the BLUE of other
parameters βi , infact by a similar process we can show that
1 X
β̂i = yij − α̂
ni
where n1i yij is the mean of the observations belonging to i th
P
group.
I This process of finding the BLUE appears to be little difficult
(at least for this example). Generally for this example, this is
not the way we find BLUE of parameters, we have better ways
Example (Contd.)
I Now suppose we consider the alternative parametrization of the model,
and include dummy variables for all the levels as
y = α + β1 x1 + .... + βk xk + .
I Then with our previous notations we find that the design matrix now
looks like
k +1 columns
z }| {
1 1 0 0 0
 
1 1 0 0 0
 
 
 
1 1 0 0 0
 
− − − − − − − −
 
1 0 1 0 0  

1
 1n1 B1
0 1 0 0  1n2

   B2 
X =
1
= 
0 1 0 0 
  1n


− Bk−1 
− − − − − − − k−1

  1nk Bk
 
− − − − − − − −
 
1 0 0 0 1
 
1 0 0 0 1
 
 
Example (contd.)

I Clearly we find that X is not of full column rank.

Example (contd.)

I Clearly we find that X is not of full column rank.

I We note that no parameters α, βi , i = 1, 2, ..., k are estimable.

Example (contd.)

I Clearly we find that X is not of full column rank.

I We note that no parameters α, βi , i = 1, 2, ..., k are estimable.

I However still there are parametric functions which are

estimable, for example, α + β1 is estimable and we can think
about finding the BLUE.
Method of least squares

I Hence the objective is to minimize

(y − X β)T (y − X β)

with respect to β.
Method of least squares

I Hence the objective is to minimize

(y − X β)T (y − X β)

with respect to β.

I Setting
∂
(y − X β)T (y − X β) = 0
∂β
we get
−2X T y + 2(X T X )β = 0.
Method of least squares

I Hence the objective is to minimize

(y − X β)T (y − X β)

with respect to β.

I Setting
∂
(y − X β)T (y − X β) = 0
∂β
we get
−2X T y + 2(X T X )β = 0.

I This implies (X T X )β = X T y which we call normal equations.

Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,

that is , ||y − X β|| is minimized.
Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,

that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,

that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y

on C(X ).
Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,

that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y

on C(X ).

I This implies y − X β is the perpendicular to C(X ).

Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,

that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y

on C(X ).

I This implies y − X β is the perpendicular to C(X ).

I Hence y − X β is orthogonal to every column of X , that is,

X T (y − X β) = 0
Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,

that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y

on C(X ).

I This implies y − X β is the perpendicular to C(X ).

I Hence y − X β is orthogonal to every column of X , that is,

X T (y − X β) = 0

I This means (X T X )β = X T y
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff

Rank(A|b) = Rank(A).
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff

Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff

Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .

I This system is consistent iff

Rank(X T X |X T y ) = Rank(X T X ).
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff

Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .

I This system is consistent iff

Rank(X T X |X T y ) = Rank(X T X ).

I Now X T y ∈ C(X T ) = C(X T X ).

Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff

Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .

I This system is consistent iff

Rank(X T X |X T y ) = Rank(X T X ).

I Now X T y ∈ C(X T ) = C(X T X ).

I Hence the result.

Gauss Markoff theorem
I For the linear model y = X β + where E () = 0 and
Var () = σ 2 I where y is observed, X is known and β, σ 2 are
unknown, the best linear unbiased estimator (BLUE) of an
estimable linear parametric function c T β is c T β̂, where β̂ is
any solution of the normal equations (X T X )β = X T y which
are obtained by minimizing the quantity
(y − X β)T (y − X β)
with respect to β.
Gauss Markoff theorem
I For the linear model y = X β + where E () = 0 and
Var () = σ 2 I where y is observed, X is known and β, σ 2 are
unknown, the best linear unbiased estimator (BLUE) of an
estimable linear parametric function c T β is c T β̂, where β̂ is
any solution of the normal equations (X T X )β = X T y which
are obtained by minimizing the quantity
(y − X β)T (y − X β)
with respect to β.
I Note that Gauss-Markoff theorem works with the minimal
assumptions E () = 0 and Var () = σ 2 I regarding errors
which we have in our basic setup of linear models. It does not
require any distributional assumptions regarding errors.
Gauss Markoff theorem
I For the linear model y = X β + where E () = 0 and
Var () = σ 2 I where y is observed, X is known and β, σ 2 are
unknown, the best linear unbiased estimator (BLUE) of an
estimable linear parametric function c T β is c T β̂, where β̂ is
any solution of the normal equations (X T X )β = X T y which
are obtained by minimizing the quantity
(y − X β)T (y − X β)
with respect to β.
I Note that Gauss-Markoff theorem works with the minimal
assumptions E () = 0 and Var () = σ 2 I regarding errors
which we have in our basic setup of linear models. It does not
require any distributional assumptions regarding errors.
I The Gauss-Markov theorem is also called the principle of
substitution, since the BLUE of any estimable LPF c T β is
obtained simply by substituting any least squares estimator LS
for β̂, thus getting c T β̂.
Case I: Xn×p is of full column rank

I Let us now discuss the case where we assume that

Rank(Xn×p ) = p. Then immediately we find that
Rank(X T X ) = p, which means X T X is non-singular. Hence
the normal equations will have a unique solution

β̂ = (X T X )−1 (X T y )

which is called the least square estimate of β.

Example: Simple Linear Regression

I Consider a simple linear regression model

y = a + bx + .
Example: Simple Linear Regression

I Consider a simple linear regression model

y = a + bx + .

I For paired data {(xi , yi ), i = 1, 2, ..., n} we have the linear

model
y = Xβ +
where y = (y1 , y2 , ..., yn ) and β = (a, b) and
 
1 x1
1 x2 
X = . .  .
 
 .. .. 
1 xn
I We note that P
XTX = Pn P x2i
xi xi
and hence
P P
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P
xi
I We note that P
XTX = Pn P x2i
xi xi
and hence
P P
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P
xi

I Further P
y
X y= P i .
T
xi yi
I We note that P
XTX = Pn P x2i
xi xi
and hence
P P
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P
xi

I Further P
y
X y= P i .
T
xi yi

I Simplifying we get β̂.

Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .
Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 +

where xi is the dummy variable for the i th level of the factor.

Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 +

where xi is the dummy variable for the i th level of the factor.

I Recall that for simplicity we have defined yij to be the j th

observation receiving the i th level Ai , j = 1, 2, ..., ni and
i = 1, 2, ..., k.
Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 +

where xi is the dummy variable for the i th level of the factor.

I Recall that for simplicity we have defined yij to be the j th

observation receiving the i th level Ai , j = 1, 2, ..., ni and
i = 1, 2, ..., k.

I Also we discussed that in this model the design matrix X is of

full column rank which means every parametric function is
estimable.
Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 +

where xi is the dummy variable for the i th level of the factor.

I Recall that for simplicity we have defined yij to be the j th

observation receiving the i th level Ai , j = 1, 2, ..., ni and
i = 1, 2, ..., k.

I Also we discussed that in this model the design matrix X is of

full column rank which means every parametric function is
estimable.

I Suppose we want to find the least square estimate of α, which

we now know will be the BLUE.
Example (Contd.)
I We could have simply apply the formula θ̂ = (X T X )−1 X T y
but computing (X T X )−1 will be a tedious job here.
Example (Contd.)
I We could have simply apply the formula θ̂ = (X T X )−1 X T y
but computing (X T X )−1 will be a tedious job here.
I Instead we can directly attempt minimizing
X n
S= (yi − α − β1 x1i − .... − βk−1 xk−1i )2
i=1
w.r.t. the parameter α.
Example (Contd.)
I We could have simply apply the formula θ̂ = (X T X )−1 X T y
but computing (X T X )−1 will be a tedious job here.
I Instead we can directly attempt minimizing
X n
S= (yi − α − β1 x1i − .... − βk−1 xk−1i )2
i=1
w.r.t. the parameter α.
I Now we note that with our new notations yij ’s, S takes the
k
X
form S = Si where
i=1
n
Xi

Si = (yij − α − βi )2 for observations in i th group, i 6= k

j=1

and
nk
X
Sk = (yij − α)2 for observations in k th group.
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1

I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1

I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1

I Now similarly setting ∂S

∂βi
= 0 for each i = 1, 2, ...., k − 1 we get
ni
X
(yij − α − βi ) = 0.
j=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1

I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1

I Now similarly setting ∂S

∂βi
= 0 for each i = 1, 2, ...., k − 1 we get
ni
X
(yij − α − βi ) = 0.
j=1

I Hence we get
nk
X
(ykj − α) = 0
j=1

which implies
α̂ = ȳk0 .
Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of

full column rank.
Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of

full column rank.

I Then it happens to be the case where X T X is singular.

Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of

full column rank.

I Then it happens to be the case where X T X is singular.

I There are generally two ways to deal with this situation.

Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of

full column rank.

I Then it happens to be the case where X T X is singular.

I There are generally two ways to deal with this situation.

I Both ways has their own merits and demerits and we shall
discuss both of them soon.
Using Generalized Inverse

I The first way to deal with this singular X T X is to use a

generalized inverse.
Using Generalized Inverse

I The first way to deal with this singular X T X is to use a

generalized inverse.

I Recall a generalized inverse of a matrix A is another matrix,

denoted by A− , such that

AA− A = A− AA = A.
Using Generalized Inverse

I The first way to deal with this singular X T X is to use a

generalized inverse.

I Recall a generalized inverse of a matrix A is another matrix,

denoted by A− , such that

AA− A = A− AA = A.

I Thus clearly we can use g-inverse of X T X to get a solution of

the normal equation as

β̂ = (X T X )− X T y .
Example (singular X T X )

I Recall our old example, of a linear model having one factor

covariate with k levels A1 , A2 , ..., Ak . Now let us consider the
alternative parametrization
k
X
y =α+ βi xi + .
i=1
Example (singular X T X )

I Recall our old example, of a linear model having one factor

covariate with k levels A1 , A2 , ..., Ak . Now let us consider the
alternative parametrization
k
X
y =α+ βi xi + .
i=1

I Recall that if we suitably arrange y 0 s then the design matrix

has a form  
1n1 B1
 1n2 B2 
 
X =  

 1n Bk−1 
k−1
1nk Bk
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is

singular. Let us check how does the normal equation turn out
in this case.
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is

singular. Let us check how does the normal equation turn out
in this case.

I We note that
 
n n1 n2 ··· nk
 n1 n1 0
 ··· 0 
X T X =  n2 0 n2
 ··· 0 
 .. .. .. 
. . . 
nk 0 0 ··· nk
Example (Contd.)

I Further  
Xk X ni
 yij 
 
 i=1 j=1 
n
 1
 X 


 y1j  
 j=1 
T
 n2 
X y =
 X 
 y2j  
 j=1 
..
 
 

 nk . 

 X 
 y kj
j=1
Example (Contd.)

I Thus the normal equations are

 ni

X k X
 yij 
 
 i=1 j=1 
    X  n 1 
n n1 n2 ··· nk α

 y1j 
n1 n1 0 ··· 0  β1  
 
j=1
  

n
n2 0 n2 ··· 0  β2  =  X 2
    

y2j 

 .. .. ..   ..  

. . .  .   
 j=1

nk 0 0 ··· nk βk ..

 
 nk .
 

 X 
 y 
kj
j=1
Example (Contd.)
I Clearly there can be many solutions to this system as obtained
by (X T X )− .
Example (Contd.)
I Clearly there can be many solutions to this system as obtained
by (X T X )− .

I One immediate solution can be found by looking at the

system: Choose α = 0, so that we can ignore the first row and
column of X T X to get a submatrix
 
n1 0 · · · 0
 0 n2 · · · 0 
 .. ..
 

. . 
0 0 ··· nk
ni
X
1
whose inverse do exist. Then we get β̂i = ni yij = ȳi0 .
j=1
Example (Contd.)
I Clearly there can be many solutions to this system as obtained
by (X T X )− .

I One immediate solution can be found by looking at the

I Thus α̂ = 0 and β̂i = ȳi0 for all i are one set of solution of the
normal equation.
Example (Contd.)

I The corresponding g-inverse we are taking here in this case is

T − 0 0
(X X ) =
0 D

where D = diag ( n11 , n12 , ...., n1k ).

Example (Contd.)

I The corresponding g-inverse we are taking here in this case is

T − 0 0
(X X ) =
0 D

where D = diag ( n11 , n12 , ...., n1k ).

I Obviously there can be other solutions corresponding to other

g-inverses: Can you find one such?
Alternative Way

I Another way to deal with singular (X T X ) is to impose

identifiability constraints.
Alternative Way

I Another way to deal with singular (X T X ) is to impose

identifiability constraints.

I Note that in this case we could have worked with

β̂ = (X T X )− X T y so that c T β̂ is the LSE (and BLUE also) of
any estimable parametric function c T β.
Alternative Way

I Another way to deal with singular (X T X ) is to impose

identifiability constraints.

I Note that in this case we could have worked with

β̂ = (X T X )− X T y so that c T β̂ is the LSE (and BLUE also) of
any estimable parametric function c T β.

I However in that case we call β̂ only to be a solution of the

normal equations- it may not be LSE (or BLUE) of β, in fact
some or all the model parameters may not be estimable in this
case. This causes problems in some linear models where
estimating the model parameters is an important issue.
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly

independent rows or columns.
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly

independent rows or columns.

I Then we consider a matrix H with rows from the

orthcomplement space of R(X ) such that
" #
X
Rank = p.
H
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly

independent rows or columns.

I Then we consider a matrix H with rows from the

orthcomplement space of R(X ) such that
" #
X
Rank = p.
H

I It is enough if we take p − r linearly independent vectors from

R(X )⊥ .
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly

independent rows or columns.

I Then we consider a matrix H with rows from the

orthcomplement space of R(X ) such that
" #
X
Rank = p.
H

I It is enough if we take p − r linearly independent vectors from

R(X )⊥ .

I And then we impose the constraint

Hβ = 0.
Identifiability Constraints (Contd.)

I Then we note that

T
X X X
β = XT HT β
H H H

= (X T X + H T H)β
= (X T X )β = X T y .
Identifiability Constraints (Contd.)

I Then we note that

T
X X X
β = XT HT β
H H H

= (X T X + H T H)β
= (X T X )β = X T y .

I Hence we get a new system of linear equations

(X T X + H T H)β = X T y .
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.

I This means
(X T X + H T H)β = X T y
has a unique solution
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.

I This means
(X T X + H T H)β = X T y
has a unique solution

I The unique solution is given by

β̂ = (X T X + H T H)−1 X T y .
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.

I This means
(X T X + H T H)β = X T y
has a unique solution

I The unique solution is given by

β̂ = (X T X + H T H)−1 X T y .

I The problem with this approach is β̂ is dependent on the choice of linear

constraints H and there can be many such constraints, which means β̂ is
not unique.
Example

I Recall our old example, of a linear model having one factor

covariate with k levels A1 , A2 , ..., Ak .
Example

I Recall our old example, of a linear model having one factor

covariate with k levels A1 , A2 , ..., Ak .

I We consider the alternative parametrization

k
X
y =α+ βi xi + .
i=1
Example

I Recall our old example, of a linear model having one factor

covariate with k levels A1 , A2 , ..., Ak .

I We consider the alternative parametrization

k
X
y =α+ βi xi + .
i=1

I Recall that if we suitably arrange y 0 s then the design matrix

has a form  
1n1 B1
 1n2 B2 
 
X =  

 1n Bk−1 
k−1
1nk Bk
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is

singular. We have already discussed one least square solution
in this case.
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is

singular. We have already discussed one least square solution
in this case.

I Here we can impose constraints to make the parameters

identifiable.
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is

singular. We have already discussed one least square solution
in this case.

I Here we can impose constraints to make the parameters

identifiable.
P P
I For example,
P one can impose ni βi = 0 or βi = 0 or in
general wi βi = 0
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is

singular. We have already discussed one least square solution
in this case.

I Here we can impose constraints to make the parameters

identifiable.
P P
I For example,
P one can impose ni βi = 0 or βi = 0 or in
general wi βi = 0

I The final least square estimates will change depending on

which constraint we use.
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).

I As such
T 0 0
H H= T .
0 n(1) n(1)
P
Working with constraint ni βi = 0

I As such
T 0 0
H H= T .
0 n(1) n(1)

I Recall that T
n n(1)

T
X X =
n(1) D
where D = diag (n1 , n2 , ..., nk ).
I This implies
T
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)
I This implies
T
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)

I Need to find the inverse of this matrix.

I This implies
T
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)

I Need to find the inverse of this matrix.

I Recall two facts from matrix algebra

I Assuming S to be non-singular
−1
F −1 −FQS −1

P Q
=
R S −S −1 RF −1 S −1 + S −1 RF −1 QS −1

where F = P − QS −1 R.
I If A is non-singular

A−1 uv T A−1
(A + uv T )−1 = A−1 − .
1 + v T A−1 u
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.

I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.

I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)

k
X Xk
T
I Now n(1) D −1 n(1) = ni2 n1i = ni = n.
i=1 i=1
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.

I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)

k
X Xk
T
I Now n(1) D −1 n(1) = ni2 n1i = ni = n.
i=1 i=1

I Further
1
  
n1
0 ··· 0 n1
1
0
n2
··· 0  n2 

D −1 n(1) = 

 .. .. .. ..  ..  = 1k .


. . . . .

1
0 0 ··· nk
nk
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1

I This is S −1 in our general form of the first fact.

I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1

I This is S −1 in our general form of the first fact.

I As such QS −1 R is

T −1
1
T
n(1) (D + n(1) n(1) T
) n(1) = n(1) D −1 − 1k 1Tk n(1)
n+1
T
n(1) 1k 1Tk n(1)
T
= n(1) D −1 n(1) −
n+1
n2 n
=n− = .
n+1 n+1
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1

I This is S −1 in our general form of the first fact.

I As such QS −1 R is

T −1
1
T
n(1) (D + n(1) n(1) T
) n(1) = n(1) D −1 − 1k 1Tk n(1)
n+1
T
n(1) 1k 1Tk n(1)
T
= n(1) D −1 n(1) −
n+1
n2 n
=n− = .
n+1 n+1

I Thus F = p − QS −1 R = n − n2
n
n+1
= n+1
and F −1 = n+1
n2
.
I Next we find F −1 QS −1 as
n+1 T T −1
n (D + n(1) n(1) )
n2 (1)
n + 1 T −1 1k 1T
k

= n D −
n2 (1) n+1
T T
n + 1 T n(1) 1k 1k
= 1 −
n2 k n2
n+1 T 1 1
= 1k − 2 (n, n, ...n) = 2 1T
n 2 n n k
I Further RF −1 QS −1 is

n1 n1
   
 .  1 1
1  .  1 T
 . 
 .  n2 ··· =  .  1 ··· 1 = n(1) 1k .
n2 n2  .  n2
nk nk
I Further RF −1 QS −1 is

n1 n1
   
 .  1 1
1  .  1 T
 . 
 .  n2 ··· =  .  1 ··· 1 = n(1) 1k .
n2 n2  .  n2
nk nk

I This implies S −1 RF −1 QS −1 is

T −1 1 T 1 −1 1 T T
(D + n(1) n(1) ) n(1) 1k = (D − 1k 1k )n(1) 1k
n2 n2 n+1

1 T 1 T T
= 1k 1k − 1k (1k n(1) )1k
n2 n+1

1 T n T
= 1k 1k − 1k 1k
n2 n+1

1 T
= 1k 1k .
n2 (n + 1)
I Hence S −1 + S −1 RF −1 QS −1 is

−1 1 T 1 T
D − 1k 1k + 1k 1k
n+1 n2 (n + 1)

−1 n−1 T
=D − 1k 1 k
n2
 1
− n−1 − n−1 ··· − n−1

n n2 2 2
 1 n−1
n n
1 − n−1 − n−1
 − 2 ···

 n n2 2 n n2 

= . . . .

. . . .
 
. . . .
 
 
n−1
− 2 − n−1 ··· 1 − n−1
n 2 n nk 2 n
I Hence S −1 + S −1 RF −1 QS −1 is

−1 1 T 1 T
D − 1k 1k + 1k 1k
n+1 n2 (n + 1)

−1 n−1 T
=D − 1k 1 k
n2
 1
− n−1 − n−1 ··· − n−1

n n2 2 2
 1 n−1
n n
1 − n−1 − n−1
 − 2 ···

 n n2 2 n n2 

= . . . .

. . . .
 
. . . .
 
 
n−1
− 2 − n−1 ··· 1 − n−1
n 2 n nk 2 n

I Finally S −1 RF −1 is

−1 1 T
n+1
D − 1k 1k n(1)
n+1 n2

n + 1 n 1
= 1k − 1k = 1k .
n2 n+1 n2
I Thus we get
 k ni 
XX
 yij 
 i=1 j=1 
 
 n+1 −1 −1 −1   n1 
2 ···
n2 n2 n2
 X 
 n1 y1j 

1 − n−1 − n−1 n−1  
 α̂ 
− 2 ··· − 2

 n n1 n2 2 n n
  j=1 
β̂1  − 1 1
 
− n−1 − n−1 ··· − n−1  X n2 
 n2 n2 2 n2
 
 . = n2 n
 
.
 .   .
 y2j 
 .  . . . .   
  j=1
 . . . . .   
 . . . . .
 
β̂k  .

− 12 − n−1 − n−1 1 − n−1 .
 
···
n2 n2 n2 .
 
n nk  
 n 
 X k 
y
 
kj
j=1
I Thus we get
k ni k ni
n + 1XX 1 XX
α̂ = yij − yij
n2 i=1 j=1 n2 i=1 j=1
k ni
1 XX
= yij = ȳ00
n i=1 j=1

which is the mean of all the observations, that is, the grand mean.
I Thus we get
k ni k ni
n + 1XX 1 XX
α̂ = yij − yij
n2 i=1 j=1 n2 i=1 j=1
k ni
1 XX
= yij = ȳ00
n i=1 j=1

which is the mean of all the observations, that is, the grand mean.

I Further for each i,

k ni n ni
1 XX n − 1 X X̀ 1 n−1 X
β̂i = − yij + y`j + + yij
n2 i=1 j=1 n2 j=1
ni n2 j=1
`6=i

k ni k n ni
1 XX n − 1 X X̀ 1 X
=− yij + y`j + yij = ȳi0 − ȳ00
n2 i=1 j=1 n2 j=1
ni j=1
`=1

where ȳi0 is the mean of the observations of the i th group.

Alternative Method

I Want to minimize
n
X
S= (yi − α − β1 x1i − β2 x2i − ... − βk xki )2
i=1
P
w.r.t. α, β1 , ..., βk subject to the condition ni βi = 0.
Alternative Method

I Want to minimize
n
X
S= (yi − α − β1 x1i − β2 x2i − ... − βk xki )2
i=1
P
w.r.t. α, β1 , ..., βk subject to the condition ni βi = 0.

I We understand
k
X
S= Si
i=1

where
ni
X
Si = (yij − α − βi )2 .
j=1
∂S P ∂Si
I Thus setting ∂α = ∂α = 0 we get

ni
k X
X
(yij − α − βi ) = 0
i=1 j=1
P
which implies α̂ = ȳ00 as ni βi = 0.
∂S P ∂Si
I Thus setting ∂α = ∂α = 0 we get

ni
k X
X
(yij − α − βi ) = 0
i=1 j=1
P
which implies α̂ = ȳ00 as ni βi = 0.
∂S ∂Si
I Again from ∂βi = ∂βi = 0 we get

ni
X
(yij − α − βi ) = 0
j=1

which implies β̂i = ȳi0 − ȳ00 .

What about the fitted values and residuals?

I Here the fitted values of y will be

ŷ = X β̂ = X (X T X )− X T y = Hy (say)
What about the fitted values and residuals?

I Here the fitted values of y will be

ŷ = X β̂ = X (X T X )− X T y = Hy (say)

I But then what is E (ŷ ) and Var (ŷ )?

What about the fitted values and residuals?

I Here the fitted values of y will be

ŷ = X β̂ = X (X T X )− X T y = Hy (say)

I But then what is E (ŷ ) and Var (ŷ )?

I We can write

E (ŷ ) = HE (y ) = HX β = X β

and
Var (ŷ ) = σ 2 H = σ 2 X (X T X )− X T .
Fitted values and residuals (Contd.)

I We note that there may exist many g-inverses of (X T X ) but

for any such g-inverse H = X (X T X )− X T will be same.
Fitted values and residuals (Contd.)

I We note that there may exist many g-inverses of (X T X ) but

for any such g-inverse H = X (X T X )− X T will be same.

I Thus combining both the cases of non-singular and singular

X T X we can write the unique fitted values are

ŷ = Hy

where H = X (X T X )− X T and E (ŷ ) = X β and

Var (ŷ ) = σ 2 H.
Fitted values and residuals (Contd.)

I We note that there may exist many g-inverses of (X T X ) but

for any such g-inverse H = X (X T X )− X T will be same.

I Thus combining both the cases of non-singular and singular

X T X we can write the unique fitted values are

ŷ = Hy

where H = X (X T X )− X T and E (ŷ ) = X β and

Var (ŷ ) = σ 2 H.

I Further the residuals are given by

e = y − ŷ = (In − H)y .
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a

parametric function c T β.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a

parametric function c T β.

I The fact is that β is estimable means every component βi is

estimable.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a

parametric function c T β.

I The fact is that β is estimable means every component βi is

estimable.

I For each βi , if we express as a linear parametric function c T β,

the choice of c is i th unit vector ui .
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a

parametric function c T β.

I The fact is that β is estimable means every component βi is

estimable.

I For each βi , if we express as a linear parametric function c T β,

the choice of c is i th unit vector ui .

I Thus every such βi will be estimable if all the p unit vectors ui

belongs to the row space of X , which amounts to say, that X
is of full column rank.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a

parametric function c T β.

I The fact is that β is estimable means every component βi is

estimable.

I For each βi , if we express as a linear parametric function c T β,

the choice of c is i th unit vector ui .

I Thus every such βi will be estimable if all the p unit vectors ui

belongs to the row space of X , which amounts to say, that X
is of full column rank.

I Hence all the parameters in a linear model are only estimable

iff the design matrix is of full column rank.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a

single time.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a

single time.

I Similarly we can saythat C β is estimable iff

" X #
Rank(X ) = Rank − .
C
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a

single time.

I Similarly we can saythat C β is estimable iff

" X #
Rank(X ) = Rank − .
C

I This explains why in case of singular (X T X ) we are taking

about the estimation of X β.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a

single time.

I Similarly we can saythat C β is estimable iff

" X #
Rank(X ) = Rank − .
C

I This explains why in case of singular (X T X ) we are taking

about the estimation of X β.

I Since X β is estimable, the least square estimate ŷ = X β̂ is the

BLUE of X β.
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter

but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter

but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.

I We note that e = (I − H)y which implies e T e = y T (I − H)y .

This quantity is called the residual sum of squares (RSS) or
sum of squares due to error (SSE).
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter

but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.

I We note that e = (I − H)y which implies e T e = y T (I − H)y .

This quantity is called the residual sum of squares (RSS) or
sum of squares due to error (SSE).

I Recall that if X has mean µ and variance Σ, then

E (X T AX ) = µT Aµ + tr (AΣ)
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter

but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.

I We note that e = (I − H)y which implies e T e = y T (I − H)y .

This quantity is called the residual sum of squares (RSS) or
sum of squares due to error (SSE).

I Recall that if X has mean µ and variance Σ, then

E (X T AX ) = µT Aµ + tr (AΣ)

I Then
E (e T e) = E (y T (I − H)y )
= (X β)T (I − H)(X β) + Tr (σ 2 (I − H))
Estimation (Contd.)

I Now (X β)T (I − H)(X β) = 0 and

Tr (I − H) = Tr (I ) − Tr (H)

= n − Rank(H) = n − r
if Rank(X ) = r .
Estimation (Contd.)

I Now (X β)T (I − H)(X β) = 0 and

Tr (I − H) = Tr (I ) − Tr (H)

= n − Rank(H) = n − r
if Rank(X ) = r .
T
e e
I Thus n−r is an unbiased estimate of σ 2 . We call it least square
estimate of σ 2 : that does not mean we apply least squares to
get the estimator, rather the estimator is obtained from least
square residuals. We denote it as

c2 LSE = SSE .
σ
n−r
Maximum Likelihood Estimation

I For likelihood estimation it is mandatory that we have some

distributional assumption regarding . Assume ∼ Nn (0, σ 2 In )
Maximum Likelihood Estimation

I For likelihood estimation it is mandatory that we have some

distributional assumption regarding . Assume ∼ Nn (0, σ 2 In )

I Here β and σ 2 are unknown parameters and the likelihood

function is given by
1 1 T (y −X β)
L(β, σ 2 ) = n e − 2σ2 (y −X β)
(2πσ 2 ) 2

and hence
n 1
` = ln L(β, σ 2 ) = const − ln σ 2 − 2 (y − X β)T (y − X β).
2 2σ
MLE (Contd.)

∂`
I Setting ∂β = 0 we get

1 T
X (y − X β) = 0
σ2
MLE (Contd.)

∂`
I Setting ∂β = 0 we get

1 T
X (y − X β) = 0
σ2

I Now if we set σ 2 > 0, we get

XTXβ = XTy

which implies β̂ = β̂LSE .

MLE (Contd.)

∂`
I Setting ∂β = 0 we get

1 T
X (y − X β) = 0
σ2

I Now if we set σ 2 > 0, we get

XTXβ = XTy

which implies β̂ = β̂LSE .

I Hence
L(β, σ 2 ) ≤ L(β̂, σ 2 ) for all σ 2 > 0
with equality iff β = β̂.
MLE (Contd.)

I Let us now maximize L(β̂, σ 2 ) with respect to σ 2 .

MLE (Contd.)

I Let us now maximize L(β̂, σ 2 ) with respect to σ 2 .

∂`
I Setting ∂σ 2
= 0 we get

1 (y − X β̂)T (y − X β̂)
− n = 0.
σ2 σ2
MLE (Contd.)

I Let us now maximize L(β̂, σ 2 ) with respect to σ 2 .

∂`
I Setting ∂σ 2
= 0 we get

1 (y − X β̂)T (y − X β̂)
− n = 0.
σ2 σ2

I From σ 2 > 0 we get,

c2 = 1 (y − X β̂)T (y − X β̂).
σ
n
MLE (Contd.)

I Thus we get for all β and σ 2 > 0,

c2 ) ≥ L(β̂, σ 2 ) ≥ L(β, σ 2 ).
L(β̂, σ
MLE (Contd.)

I Thus we get for all β and σ 2 > 0,

c2 ) ≥ L(β̂, σ 2 ) ≥ L(β, σ 2 ).
L(β̂, σ

I Hence finally we get

β̂MLE = β̂LSE

and
c2 MLE = 1 (y − X β̂)T (y − X β̂) = SSE .
σ
n n
General linear hypothesis
I Consider a linear model

y = X β + .
General linear hypothesis
I Consider a linear model

y = X β + .

I A general linear hypothesis is of the form

H0 : Lβ = c

where Lq×p is of rank q.

General linear hypothesis
I Consider a linear model

y = X β + .

I A general linear hypothesis is of the form

H0 : Lβ = c

where Lq×p is of rank q.

I For testing such hypothesis first we need to have a

distributional assumption ∼ Nn (0, σ 2 I ) which means
y ∼ N(X β, σ 2 I ).
General linear hypothesis
I Consider a linear model

y = X β + .

I A general linear hypothesis is of the form

H0 : Lβ = c

where Lq×p is of rank q.

I For testing such hypothesis first we need to have a

distributional assumption ∼ Nn (0, σ 2 I ) which means
y ∼ N(X β, σ 2 I ).

I Any linear hypothesis H0 : Lβ = c is called testable, if Lβ is

estimable, that is R(L) ⊂ R(X ).
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
!
(1)
I Partition β as β =
βp1 ×1
(2) where p1 + p2 = p. Suppose we wish to test
βp2 ×1
H0 : β (2)

= 0. Then we choose Lp2 ×p1 = 0 Ip2 .
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
!
(1)
I Partition β as β =
βp1 ×1
(2) where p1 + p2 = p. Suppose we wish to test
βp2 ×1
H0 : β (2)

= 0. Then we choose Lp2 ×p1 = 0 Ip2 .
I Often we need to test H0 : β1 = β2 = ... = βp = 0. The hypothesis can
be written as 
β1 − β2 =


0
β1 − β3 =
 0
H0 : . .. .

 .. .


β1 − βp =

0
Here the choice of L is
1 −1 0 ··· 0
 
1 0 −1 ··· 0
Lp−1×p = . .. .. .. ..  .
 
 .. . . . . 
1 0 0 ··· −1
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
!
(1)
I Partition β as β =
βp1 ×1
(2) where p1 + p2 = p. Suppose we wish to test
βp2 ×1
H0 : β (2)

= 0. Then we choose Lp2 ×p1 = 0 Ip2 .
I Often we need to test H0 : β1 = β2 = ... = βp = 0. The hypothesis can
be written as 
β1 − β2 =


0
β1 − β3 =
 0
H0 : . .. .

 .. .


β1 − βp =

0
Here the choice of L is
1 −1 0 ··· 0
 
1 0 −1 ··· 0
Lp−1×p = . .. .. .. ..  .
 
 .. . . . . 
1 0 0 ··· −1

I For testing a linear combination β as H0 : `T β = m we may choose

L = `T and c = m is a scalar.
First fundamental theorem

I Suppose we assume ∼ Nn (0, σ 2 ), that is, y ∼ Nn (X β, σ 2 ).

Let
R02 = SSE = min(y − X β)T (y − X β).
β

Then R02 ∼ σ 2 χ2n−r where r = Rank(X ).

Second fundamental theorem
I Let y ∼ Nn (X β, σ 2 In ) where Rank(Xn×p ) = r and Lq×p be a
matrix of rank q such that R(L) ⊂ R(X ). Further let us define

R02 = SSE = min(y − X β)T (y − X β)

and
R12 = SSEH0 = min (y − X β)T (y − X β).
β:Lβ=c

where c is known. Then

I R02 and R12 − R02 are independently distributed.
I R02 ∼ σ 2 χ2n−r and R12 − R02 has a non-central χ2 with q
degrees of freedom.
I If Lβ = c is true then R12 − R02 ∼ σ 2 χ2q and

R12 − R02 R02

÷ ∼ Fq,n−r .
q n−r
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
Testing H0 : Motivation 1