0% found this document useful (0 votes)
1 views313 pages

Linear Model Recap 2

The document discusses the estimation of parametric functions in linear models, focusing on the conditions for estimability and the existence of linear unbiased estimators (LUEs). It introduces concepts such as the Best Linear Unbiased Estimator (BLUE), estimation space, and error space, along with algorithms for finding the BLUE from a LUE. Examples illustrate the application of these concepts in linear models, including the use of dummy variable regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views313 pages

Linear Model Recap 2

The document discusses the estimation of parametric functions in linear models, focusing on the conditions for estimability and the existence of linear unbiased estimators (LUEs). It introduces concepts such as the Best Linear Unbiased Estimator (BLUE), estimation space, and error space, along with algorithms for finding the BLUE from a LUE. Examples illustrate the application of these concepts in linear models, including the use of dummy variable regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 313

Review of Linear Models II

Presidency University

February, 2025
Estimation

I A parametric function or a linear parametric function is a linear


Xp
T
combination of parameters c β = ci βi . A parametric
i=1
function c T β is called estimable if there exists a a ∈ Rn such
that E (aT y ) = c T β, that is there exists a linear unbiased
estimator (LUE) of c T β.
Estimation

I A parametric function or a linear parametric function is a linear


Xp
T
combination of parameters c β = ci βi . A parametric
i=1
function c T β is called estimable if there exists a a ∈ Rn such
that E (aT y ) = c T β, that is there exists a linear unbiased
estimator (LUE) of c T β.

I Any parametric function c T β is estimable iff c T ∈ R(X ) (or


equivalently c T ∈ C(X T ) or equivalently c T ∈ C(X T X )).
Two Corollaries

I Any estimable parametric function is of the form aT X β for


some a ∈ Rn .
Two Corollaries

I Any estimable parametric function is of the form aT X β for


some a ∈ Rn .

I If Rank(Xn×p ) = p, then all parametric functions are


estimable.
Two Corollaries

I Any estimable parametric function is of the form aT X β for


some a ∈ Rn .

I If Rank(Xn×p ) = p, then all parametric functions are


estimable.

I We have
  !
X
p = Rank(Xn×p ) ≤ Rank ≤ p.
cT (n+1)×p

  !
X
⇒ Rank(X ) = Rank whatever c T .
cT (n+1)×p

I This implies any parametric function c T β is estimable.


An important result

I Let c T β be estimable and aT y be any LUE of c T β. Then


there exists an a∗ ∈ C(X )
T
I such that a∗ y is also a LUE of c T β.
T
I Var (a∗ y ) is minimum among all LUE’s.
I choice of a∗ is unique.
An important result

I Let c T β be estimable and aT y be any LUE of c T β. Then


there exists an a∗ ∈ C(X )
T
I such that a∗ y is also a LUE of c T β.
T
I Var (a∗ y ) is minimum among all LUE’s.
I choice of a∗ is unique.
I For an estimable parametric function c T β, among the class of
all linear unbiased estimators, there exist one with minimum
variance. The choice of such linear unbiased estimator with
minimum variance is unique.
An important result

I Let c T β be estimable and aT y be any LUE of c T β. Then


there exists an a∗ ∈ C(X )
T
I such that a∗ y is also a LUE of c T β.
T
I Var (a∗ y ) is minimum among all LUE’s.
I choice of a∗ is unique.
I For an estimable parametric function c T β, among the class of
all linear unbiased estimators, there exist one with minimum
variance. The choice of such linear unbiased estimator with
minimum variance is unique.

I Summarizing we get, for any estimable parametric function


c T β, there exists a unique linear unbiased estimator. This is
called the Best Linear Unbiased Estimator (BLUE) of c T β.
Estimation space and Error space
I The column space C(X ) of the design matrix X is called the
estimation space. Further a linear function w T y is called an
error function or linear zero function (LZF) if it is an
unbiased estimator of 0, that is, if

E (w T y ) = 0 for all β.

The vector space spanned by the error functions is called the


error space.
Estimation space and Error space
I The column space C(X ) of the design matrix X is called the
estimation space. Further a linear function w T y is called an
error function or linear zero function (LZF) if it is an
unbiased estimator of 0, that is, if

E (w T y ) = 0 for all β.

The vector space spanned by the error functions is called the


error space.

I It can be shown that the orthocomplement space C(X )⊥ of


C(X ) in n dimensional vector space Rn is the error space.
Estimation space and Error space
I The column space C(X ) of the design matrix X is called the
estimation space. Further a linear function w T y is called an
error function or linear zero function (LZF) if it is an
unbiased estimator of 0, that is, if

E (w T y ) = 0 for all β.

The vector space spanned by the error functions is called the


error space.

I It can be shown that the orthocomplement space C(X )⊥ of


C(X ) in n dimensional vector space Rn is the error space.

I Thus for a linear model with design matrix Xn×p , if


Rank(X ) = r (≤ p), then there exists n − r such linearly
independent choices of w ∈ C(X )⊥ . As such there exists n − r
linearly independent error functions.
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly


independent error functions.
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly


independent error functions.

I Then find any LUE aT y of c T β.


Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly


independent error functions.

I Then find any LUE aT y of c T β.

I Then find n − r linearly independent linear zero functions, say,


e1 , e2 , ..., en−r where ei = wiT y .
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly


independent error functions.

I Then find any LUE aT y of c T β.

I Then find n − r linearly independent linear zero functions, say,


e1 , e2 , ..., en−r where ei = wiT y .

I Then write the BLUE as


T X
a∗ y = aT y + `i wiT y
Algorithm: find the BLUE from a LUE
I Interested in estimable c T β.

I First find rank of Xn×p . If X is of rank r , then we have n − r linearly


independent error functions.

I Then find any LUE aT y of c T β.

I Then find n − r linearly independent linear zero functions, say,


e1 , e2 , ..., en−r where ei = wiT y .

I Then write the BLUE as


T X
a∗ y = aT y + `i wiT y

I Find `i , for each i, by imposing the orthogonality restriction that


cov (a∗T y , wiT y ) = 0.
I If further the error functions are mutually orthogonal then for all i,
0 0
cov (a y , wi y )
`i = − 0
V (wi y )
Example

I Consider the linear model

µ1 + µ2 + µ3 + 1 = 5.1

µ1 + µ2 + µ3 + 2 = 8.2
µ1 − µ2 + 3 = 4.9
−µ1 + µ2 + 4 = 3.1
Example

I Consider the linear model

µ1 + µ2 + µ3 + 1 = 5.1

µ1 + µ2 + µ3 + 2 = 8.2
µ1 − µ2 + 3 = 4.9
−µ1 + µ2 + 4 = 3.1

I Thus here the design matrix is


 
1 1 1
1 1 1
X = 1 −1 0

−1 1 0
Example

I µ2 is not estimable as then c T = (0, 1, 0) and


 
X
rank 6= rank(X ) = 2
cT

but µ1 + µ2 + µ3 is estimable.
Example

I µ2 is not estimable as then c T = (0, 1, 0) and


 
X
rank 6= rank(X ) = 2
cT

but µ1 + µ2 + µ3 is estimable.

I A LUE of µ1 + µ2 + µ3 is y1 = aT y where aT = (1, 0, 0, 0).


Next we try to find the BLUE of µ1 + µ2 + µ3
Example

I µ2 is not estimable as then c T = (0, 1, 0) and


 
X
rank 6= rank(X ) = 2
cT

but µ1 + µ2 + µ3 is estimable.

I A LUE of µ1 + µ2 + µ3 is y1 = aT y where aT = (1, 0, 0, 0).


Next we try to find the BLUE of µ1 + µ2 + µ3

I We need to find the independent error functions. Since


Rank(X ) = 2, there will be 4-2=2 linearly independent error
functions.
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the


independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the


independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y

I The BLUE can be written as

a∗T y = aT y + `1 w1T y + `2 w2T y


Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the


independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y

I The BLUE can be written as

a∗T y = aT y + `1 w1T y + `2 w2T y

I Since here the error functions are mutually orthogonal we get


cov (aT y ,w T y )
`i = − V (w T y )i and hence we have
i
−cov (y1 ,y1 −y2 ) −cov (y1 ,y3 +y4 )
`1 = V (y1 −y2 ) = − 21 and `2 = v (y3 +y4 ) = 0.
Example
I We have E (y1 ) = µ1 + µ2 + µ3 and E (y2 ) = µ1 + µ2 + µ3
whereas E (y3 ) = µ1 − µ2 and E (y4 ) = −µ1 + µ2

I This implies E (y1 − y2 ) = 0 and E (y3 + y4 ) = 0. Thus the


independent error functions are :y1 − y2 = w1T y and
y3 + y4 = w2T y

I The BLUE can be written as

a∗T y = aT y + `1 w1T y + `2 w2T y

I Since here the error functions are mutually orthogonal we get


cov (aT y ,w T y )
`i = − V (w T y )i and hence we have
i
−cov (y1 ,y1 −y2 ) −cov (y1 ,y3 +y4 )
`1 = V (y1 −y2 ) = − 21 and `2 = v (y3 +y4 ) = 0.

I Thus the BLUE is y1 − 12 (y1 − y2 ) = 12 (y1 + y2 ) = 6.65


Example: Dummy variable regression model

I Consider the linear model with one factor covariate having k levels
A1 , A2 , ..., Ak .
Example: Dummy variable regression model

I Consider the linear model with one factor covariate having k levels
A1 , A2 , ..., Ak .

I Suppose we work with the model

y = α + β1 x1 + ... + βk−1 xk−1 + 

where xi is the dummy variable for the i th level of the factor.


I Suppose for simplicity we arrange yi ’s in such a manner that

y1 , y2 , ..., yn1 receives A1

yn1 +1 , yn1 +2 , ..., yn1 +n2 receives A2


..
.
yn1 +n2 +...+nk−1 +1 , yn1 +n2 +...+nk−1 +2 , ..., yn1 +n2 +...+nk−1 +nk receives Ak .
Example (Contd.)

I A better and simple way of thinking is we have k groups where


observations in each group receives one particular level Ai .
Example (Contd.)

I A better and simple way of thinking is we have k groups where


observations in each group receives one particular level Ai .

I From the above we find that there will be ni observations in


i th group.
Example (Contd.)

I A better and simple way of thinking is we have k groups where


observations in each group receives one particular level Ai .

I From the above we find that there will be ni observations in


i th group.

I Let us, for further simplicity, introduce the notation that yij
denotes the j th observation in i th group where j = 1, 2, .., ni
and i = 1, 2, ..., k.
I Then the design matrix X is of the form
k columns
z }| {
1 1 0 0 0
 
1 1 0 0 0
 
 
 
1 1 0 0 0
 
− − − − − − − −
 
1 0 1 0 0
 
1 0 1 0 0
 
 
   
1 0 1 0 0 1n1 B1
 
− − − − − − − −  1n2 B2 

 
X =

=
 


− − − − − − − − 1nk−1 Bk−1 
 
1 0 0 0 1 1nk 0

1 0 0 0 1
 
 
 
1 0 0 0 1
 
− − − − − − − −
 
1 0 0 0 0
 
1 0 0 0 0
 
 
1 0 0 0 0
I Further recall that X is of full column rank.
I Further recall that X is of full column rank.

I Note that here the parameter vector is


θ = (α, β1 , β2 , ..., βk−1 ).
I Further recall that X is of full column rank.

I Note that here the parameter vector is


θ = (α, β1 , β2 , ..., βk−1 ).

I Thus any parametric function c T θ is estimable.


I Further recall that X is of full column rank.

I Note that here the parameter vector is


θ = (α, β1 , β2 , ..., βk−1 ).

I Thus any parametric function c T θ is estimable.

I Suppose we want to estimate α.


I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent


error functions.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent


error functions.

I For each group we shall get ni − 1 linearly independent error


functionsPcorresponding ni − 1 rows of Bi and 0 in X , so that
we have (ni − 1) = n − k linearly independent error
functions altogether.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent


error functions.

I For each group we shall get ni − 1 linearly independent error


functionsPcorresponding ni − 1 rows of Bi and 0 in X , so that
we have (ni − 1) = n − k linearly independent error
functions altogether.

I For simplicity, let us just denote the error functions from i th


group as eij , j = 1, 2, ..., ni − 1.
I As a first step we shall start with some LUE of α. Let us take
aT y = yk1 , because E (yk1 ) = α.

I Since Rank(X ) = k, there will be n − k linearly independent


error functions.

I For each group we shall get ni − 1 linearly independent error


functionsPcorresponding ni − 1 rows of Bi and 0 in X , so that
we have (ni − 1) = n − k linearly independent error
functions altogether.

I For simplicity, let us just denote the error functions from i th


group as eij , j = 1, 2, ..., ni − 1.

I Then the BLUE will be


i −1
k nX
X
yk1 + `ij eij .
i=1 j=1
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1

I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1

I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
I These are error functions which are linearly independent but
not orthogonal.
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1

I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
I These are error functions which are linearly independent but
not orthogonal.
I Since cov (ykj , yij 0 ) = 0 for any i 6= k and any j and j 0 , we get
that
X i −1
k nX
cov (yk1 + `ij eij , eij ) = 0
i=1 j=1
boils down to
k −1
nX
cov (yk1 + `kj ekj , ekj ) = 0.
j=1
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1


linearly independent error functions to be
ykj − yk1 , j = 1, 2, ..., nk but then they will not be mutually
orthogonal and we shall have simultaneous equations involving
`kj ’s which are difficult to handle.
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1


linearly independent error functions to be
ykj − yk1 , j = 1, 2, ..., nk but then they will not be mutually
orthogonal and we shall have simultaneous equations involving
`kj ’s which are difficult to handle.

I Instead we directly aim to construct nk − 1 mutually


orthogonal error functions from nk observations
yk1 , yk2 , ..., yknk .
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1


linearly independent error functions to be
ykj − yk1 , j = 1, 2, ..., nk but then they will not be mutually
orthogonal and we shall have simultaneous equations involving
`kj ’s which are difficult to handle.

I Instead we directly aim to construct nk − 1 mutually


orthogonal error functions from nk observations
yk1 , yk2 , ..., yknk .

I This is same as constructing a nk − 1 × nk matrix with


mutually orthogonal rows (which when pre-multiplied with
(yk1 , ..., yknk ) will give the error functions).
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.

I Even for the k th group, we could have chosen the nk − 1


linearly independent error functions to be
ykj − yk1 , j = 1, 2, ..., nk but then they will not be mutually
orthogonal and we shall have simultaneous equations involving
`kj ’s which are difficult to handle.

I Instead we directly aim to construct nk − 1 mutually


orthogonal error functions from nk observations
yk1 , yk2 , ..., yknk .

I This is same as constructing a nk − 1 × nk matrix with


mutually orthogonal rows (which when pre-multiplied with
(yk1 , ..., yknk ) will give the error functions).

I But how do we catch hold of such a matrix?


Example (Contd.)

I A good point of reference is Helmert’s orthogonal matrix


whose
√1 √1 ··· ··· √1
 
n n n
 √1 −1
 2

2
0 0 ··· 0  
1 1 −2
 √ √ √ 0 ··· 0 
 
 6 6 6 
 
 
1 1 1 −(n−1)
√ √ ··· √ √
n(n−1) n(n−1) n(n−1) n(n−1)
Example (Contd.)

I A good point of reference is Helmert’s orthogonal matrix


whose
√1 √1 ··· ··· √1
 
n n n
 √1 −1
 2

2
0 0 ··· 0  
1 1 −2
 √ √ √ 0 ··· 0 
 
 6 6 6 
 
 
1 1 1 −(n−1)
√ √ ··· √ √
n(n−1) n(n−1) n(n−1) n(n−1)

I The idea is that from this n × n matrix we can simply ignore


the first row and consider other rows with choice of n as
n = nk .
Example (Contd.)
I Thus the error functions are
1
ek1 = √ (yk1 − yk2 )
2
1
ek2 = √ (yk1 + yk2 − 2yk3 )
6
and in general
1
ekj = p (yk1 + yk2 + ... + ykj − jyk(j+1) )
j(j + 1)
j
1 hX i
=p yki − jyk(j+1) .
j(j + 1) i=1
Example (Contd.)
I Thus the error functions are
1
ek1 = √ (yk1 − yk2 )
2
1
ek2 = √ (yk1 + yk2 − 2yk3 )
6
and in general
1
ekj = p (yk1 + yk2 + ... + ykj − jyk(j+1) )
j(j + 1)
j
1 hX i
=p yki − jyk(j+1) .
j(j + 1) i=1

I Then we can get `kj as


cov (yk1 , ekj )
`kj = −
Var (ekj )
Example (Contd.)

I Clearly
1
Var (ekj ) = [jσ 2 + j 2 σ 2 ] = σ 2 .
j(j + 1)
and
1 σ2
Cov (yk1 , ekj ) = p , Var (yk1 ) = p
j(j + 1) j(j + 1)
Example (Contd.)

I Clearly
1
Var (ekj ) = [jσ 2 + j 2 σ 2 ] = σ 2 .
j(j + 1)
and
1 σ2
Cov (yk1 , ekj ) = p , Var (yk1 ) = p
j(j + 1) j(j + 1)

I Combining we get
−1
`kj = p .
j(j + 1)
Example (Contd.)
I This means the BLUE is
k −1
nX
yk1 + `kj ekj
j=1

k −1
nX j
1 hX i
= yk1 − yki − jyk(j+1) .
j(j + 1)
j=1 i=1
Example (Contd.)
I This means the BLUE is
k −1
nX
yk1 + `kj ekj
j=1

k −1
nX j
1 hX i
= yk1 − yki − jyk(j+1) .
j(j + 1)
j=1 i=1

I With little algebraic manipulation, we can show that this is


equal to
nk
1 X
ykj
nk
j=1

which is the mean of the observations receiving the k th level of


the factor.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
I We could have started with any other LUE of α, for example,
any ykj will be a LUE.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
I We could have started with any other LUE of α, for example,
any ykj will be a LUE.
I The same process can be applied to find the BLUE of other
parameters βi , infact by a similar process we can show that
1 X
β̂i = yij − α̂
ni
where n1i yij is the mean of the observations belonging to i th
P
group.
Example (Contd.)
I Few points are to be noted immediately in this example:
I Unlike the previous example, here the error functions we
obtained at the beginning were not mutually orthogonal. We
could have thought of trying to convert the linearly
independent error functions to mutually orthogonal by
Grahm-Schimdt orthogonalization process, but that will be too
tedious; instead we took little help of Helmert’s transformation.
I We could have started with any other LUE of α, for example,
any ykj will be a LUE.
I The same process can be applied to find the BLUE of other
parameters βi , infact by a similar process we can show that
1 X
β̂i = yij − α̂
ni
where n1i yij is the mean of the observations belonging to i th
P
group.
I This process of finding the BLUE appears to be little difficult
(at least for this example). Generally for this example, this is
not the way we find BLUE of parameters, we have better ways
Example (Contd.)
I Now suppose we consider the alternative parametrization of the model,
and include dummy variables for all the levels as
y = α + β1 x1 + .... + βk xk + .
I Then with our previous notations we find that the design matrix now
looks like
k +1 columns
z }| {
1 1 0 0 0
 
1 1 0 0 0
 
 
 
1 1 0 0 0
 
− − − − − − − −
 
1 0 1 0 0  

1
 1n1 B1
0 1 0 0  1n2

   B2 
X =
1
= 
0 1 0 0 
  1n


− Bk−1 
− − − − − − − k−1

  1nk Bk
 
− − − − − − − −
 
1 0 0 0 1
 
1 0 0 0 1
 
 
Example (contd.)

I Clearly we find that X is not of full column rank.


Example (contd.)

I Clearly we find that X is not of full column rank.

I We note that no parameters α, βi , i = 1, 2, ..., k are estimable.


Example (contd.)

I Clearly we find that X is not of full column rank.

I We note that no parameters α, βi , i = 1, 2, ..., k are estimable.

I However still there are parametric functions which are


estimable, for example, α + β1 is estimable and we can think
about finding the BLUE.
Method of least squares

I Hence the objective is to minimize

(y − X β)T (y − X β)

with respect to β.
Method of least squares

I Hence the objective is to minimize

(y − X β)T (y − X β)

with respect to β.

I Setting

(y − X β)T (y − X β) = 0
∂β
we get
−2X T y + 2(X T X )β = 0.
Method of least squares

I Hence the objective is to minimize

(y − X β)T (y − X β)

with respect to β.

I Setting

(y − X β)T (y − X β) = 0
∂β
we get
−2X T y + 2(X T X )β = 0.

I This implies (X T X )β = X T y which we call normal equations.


Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,


that is , ||y − X β|| is minimized.
Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,


that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).


Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,


that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y


on C(X ).
Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,


that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y


on C(X ).

I This implies y − X β is the perpendicular to C(X ).


Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,


that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y


on C(X ).

I This implies y − X β is the perpendicular to C(X ).

I Hence y − X β is orthogonal to every column of X , that is,


X T (y − X β) = 0
Alternative : Geometric Approach

I Need to find β such that (y − X β)T (y − X β) is minimized,


that is , ||y − X β|| is minimized.

I Now we note that y ∈ Rn and X β ∈ C(X ).

I Then ||y − X β|| is minimized when X β is the projection of y


on C(X ).

I This implies y − X β is the perpendicular to C(X ).

I Hence y − X β is orthogonal to every column of X , that is,


X T (y − X β) = 0

I This means (X T X )β = X T y
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff


Rank(A|b) = Rank(A).
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff


Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff


Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .

I This system is consistent iff


Rank(X T X |X T y ) = Rank(X T X ).
Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff


Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .

I This system is consistent iff


Rank(X T X |X T y ) = Rank(X T X ).

I Now X T y ∈ C(X T ) = C(X T X ).


Normal equations are always solvable

I Recall that a system of equation Ax = b is consistent iff


Rank(A|b) = Rank(A).

I In our case the normal equations are

(X T X )β = X T y .

I This system is consistent iff


Rank(X T X |X T y ) = Rank(X T X ).

I Now X T y ∈ C(X T ) = C(X T X ).

I Hence the result.


Gauss Markoff theorem
I For the linear model y = X β +  where E () = 0 and
Var () = σ 2 I where y is observed, X is known and β, σ 2 are
unknown, the best linear unbiased estimator (BLUE) of an
estimable linear parametric function c T β is c T β̂, where β̂ is
any solution of the normal equations (X T X )β = X T y which
are obtained by minimizing the quantity
(y − X β)T (y − X β)
with respect to β.
Gauss Markoff theorem
I For the linear model y = X β +  where E () = 0 and
Var () = σ 2 I where y is observed, X is known and β, σ 2 are
unknown, the best linear unbiased estimator (BLUE) of an
estimable linear parametric function c T β is c T β̂, where β̂ is
any solution of the normal equations (X T X )β = X T y which
are obtained by minimizing the quantity
(y − X β)T (y − X β)
with respect to β.
I Note that Gauss-Markoff theorem works with the minimal
assumptions E () = 0 and Var () = σ 2 I regarding errors
which we have in our basic setup of linear models. It does not
require any distributional assumptions regarding errors.
Gauss Markoff theorem
I For the linear model y = X β +  where E () = 0 and
Var () = σ 2 I where y is observed, X is known and β, σ 2 are
unknown, the best linear unbiased estimator (BLUE) of an
estimable linear parametric function c T β is c T β̂, where β̂ is
any solution of the normal equations (X T X )β = X T y which
are obtained by minimizing the quantity
(y − X β)T (y − X β)
with respect to β.
I Note that Gauss-Markoff theorem works with the minimal
assumptions E () = 0 and Var () = σ 2 I regarding errors
which we have in our basic setup of linear models. It does not
require any distributional assumptions regarding errors.
I The Gauss-Markov theorem is also called the principle of
substitution, since the BLUE of any estimable LPF c T β is
obtained simply by substituting any least squares estimator LS
for β̂, thus getting c T β̂.
Case I: Xn×p is of full column rank

I Let us now discuss the case where we assume that


Rank(Xn×p ) = p. Then immediately we find that
Rank(X T X ) = p, which means X T X is non-singular. Hence
the normal equations will have a unique solution

β̂ = (X T X )−1 (X T y )

which is called the least square estimate of β.


Example: Simple Linear Regression

I Consider a simple linear regression model

y = a + bx + .
Example: Simple Linear Regression

I Consider a simple linear regression model

y = a + bx + .

I For paired data {(xi , yi ), i = 1, 2, ..., n} we have the linear


model
y = Xβ + 
where y = (y1 , y2 , ..., yn ) and β = (a, b) and
 
1 x1
1 x2 
X = . .  .
 
 .. .. 
1 xn
I We note that  P 
XTX = Pn P x2i
xi xi
and hence
P P 
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P 
xi
I We note that  P 
XTX = Pn P x2i
xi xi
and hence
P P 
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P 
xi

I Further P 
y
X y= P i .
T
xi yi
I We note that  P 
XTX = Pn P x2i
xi xi
and hence
P P 
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P 
xi

I Further P 
y
X y= P i .
T
xi yi

I Simplifying we get β̂.


Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .
Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 + 

where xi is the dummy variable for the i th level of the factor.


Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 + 

where xi is the dummy variable for the i th level of the factor.

I Recall that for simplicity we have defined yij to be the j th


observation receiving the i th level Ai , j = 1, 2, ..., ni and
i = 1, 2, ..., k.
Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 + 

where xi is the dummy variable for the i th level of the factor.

I Recall that for simplicity we have defined yij to be the j th


observation receiving the i th level Ai , j = 1, 2, ..., ni and
i = 1, 2, ..., k.

I Also we discussed that in this model the design matrix X is of


full column rank which means every parametric function is
estimable.
Example: Model with single factor covariate
I Recall the linear model with one factor covariate having k
levels A1 , A2 , ..., Ak .

I Suppose we work with the parametrization

y = α + β1 x1 + ... + βk−1 xk−1 + 

where xi is the dummy variable for the i th level of the factor.

I Recall that for simplicity we have defined yij to be the j th


observation receiving the i th level Ai , j = 1, 2, ..., ni and
i = 1, 2, ..., k.

I Also we discussed that in this model the design matrix X is of


full column rank which means every parametric function is
estimable.

I Suppose we want to find the least square estimate of α, which


we now know will be the BLUE.
Example (Contd.)
I We could have simply apply the formula θ̂ = (X T X )−1 X T y
but computing (X T X )−1 will be a tedious job here.
Example (Contd.)
I We could have simply apply the formula θ̂ = (X T X )−1 X T y
but computing (X T X )−1 will be a tedious job here.
I Instead we can directly attempt minimizing
X n
S= (yi − α − β1 x1i − .... − βk−1 xk−1i )2
i=1
w.r.t. the parameter α.
Example (Contd.)
I We could have simply apply the formula θ̂ = (X T X )−1 X T y
but computing (X T X )−1 will be a tedious job here.
I Instead we can directly attempt minimizing
X n
S= (yi − α − β1 x1i − .... − βk−1 xk−1i )2
i=1
w.r.t. the parameter α.
I Now we note that with our new notations yij ’s, S takes the
k
X
form S = Si where
i=1
n
Xi

Si = (yij − α − βi )2 for observations in i th group, i 6= k


j=1

and
nk
X
Sk = (yij − α)2 for observations in k th group.
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1

I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1

I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1

I Now similarly setting ∂S


∂βi
= 0 for each i = 1, 2, ...., k − 1 we get
ni
X
(yij − α − βi ) = 0.
j=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1

I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1

I Now similarly setting ∂S


∂βi
= 0 for each i = 1, 2, ...., k − 1 we get
ni
X
(yij − α − βi ) = 0.
j=1

I Hence we get
nk
X
(ykj − α) = 0
j=1

which implies
α̂ = ȳk0 .
Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of


full column rank.
Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of


full column rank.

I Then it happens to be the case where X T X is singular.


Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of


full column rank.

I Then it happens to be the case where X T X is singular.

I There are generally two ways to deal with this situation.


Case II: X is not of full column rank

I Let us consider the case where the design matrix X is not of


full column rank.

I Then it happens to be the case where X T X is singular.

I There are generally two ways to deal with this situation.

I Both ways has their own merits and demerits and we shall
discuss both of them soon.
Using Generalized Inverse

I The first way to deal with this singular X T X is to use a


generalized inverse.
Using Generalized Inverse

I The first way to deal with this singular X T X is to use a


generalized inverse.

I Recall a generalized inverse of a matrix A is another matrix,


denoted by A− , such that

AA− A = A− AA = A.
Using Generalized Inverse

I The first way to deal with this singular X T X is to use a


generalized inverse.

I Recall a generalized inverse of a matrix A is another matrix,


denoted by A− , such that

AA− A = A− AA = A.

I Thus clearly we can use g-inverse of X T X to get a solution of


the normal equation as

β̂ = (X T X )− X T y .
Example (singular X T X )

I Recall our old example, of a linear model having one factor


covariate with k levels A1 , A2 , ..., Ak . Now let us consider the
alternative parametrization
k
X
y =α+ βi xi + .
i=1
Example (singular X T X )

I Recall our old example, of a linear model having one factor


covariate with k levels A1 , A2 , ..., Ak . Now let us consider the
alternative parametrization
k
X
y =α+ βi xi + .
i=1

I Recall that if we suitably arrange y 0 s then the design matrix


has a form  
1n1 B1
 1n2 B2 
 
X =  

 1n Bk−1 
k−1
1nk Bk
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is


singular. Let us check how does the normal equation turn out
in this case.
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is


singular. Let us check how does the normal equation turn out
in this case.

I We note that
 
n n1 n2 ··· nk
 n1 n1 0
 ··· 0 
X T X =  n2 0 n2
 ··· 0 
 .. .. .. 
. . . 
nk 0 0 ··· nk
Example (Contd.)

I Further  
Xk X ni
 yij 
 
 i=1 j=1 
n
 1
 X 


 y1j  
 j=1 
T
 n2 
X y =
 X 
 y2j  
 j=1 
..
 
 

 nk . 

 X 
 y kj
j=1
Example (Contd.)

I Thus the normal equations are


 ni

X k X
 yij 
 
 i=1 j=1 
    X  n 1 
n n1 n2 ··· nk α

 y1j 
n1 n1 0 ··· 0  β1  
 
j=1
  

n
n2 0 n2 ··· 0  β2  =  X 2
    

y2j 

 .. .. ..   ..  

. . .  .   
 j=1

nk 0 0 ··· nk βk ..

 
 nk .
 

 X 
 y 
kj
j=1
Example (Contd.)
I Clearly there can be many solutions to this system as obtained
by (X T X )− .
Example (Contd.)
I Clearly there can be many solutions to this system as obtained
by (X T X )− .

I One immediate solution can be found by looking at the


system: Choose α = 0, so that we can ignore the first row and
column of X T X to get a submatrix
 
n1 0 · · · 0
 0 n2 · · · 0 
 .. ..
 

. . 
0 0 ··· nk
ni
X
1
whose inverse do exist. Then we get β̂i = ni yij = ȳi0 .
j=1
Example (Contd.)
I Clearly there can be many solutions to this system as obtained
by (X T X )− .

I One immediate solution can be found by looking at the


system: Choose α = 0, so that we can ignore the first row and
column of X T X to get a submatrix
 
n1 0 · · · 0
 0 n2 · · · 0 
 .. ..
 

. . 
0 0 ··· nk
ni
X
1
whose inverse do exist. Then we get β̂i = ni yij = ȳi0 .
j=1

I Thus α̂ = 0 and β̂i = ȳi0 for all i are one set of solution of the
normal equation.
Example (Contd.)

I The corresponding g-inverse we are taking here in this case is


 
T − 0 0
(X X ) =
0 D

where D = diag ( n11 , n12 , ...., n1k ).


Example (Contd.)

I The corresponding g-inverse we are taking here in this case is


 
T − 0 0
(X X ) =
0 D

where D = diag ( n11 , n12 , ...., n1k ).

I Obviously there can be other solutions corresponding to other


g-inverses: Can you find one such?
Alternative Way

I Another way to deal with singular (X T X ) is to impose


identifiability constraints.
Alternative Way

I Another way to deal with singular (X T X ) is to impose


identifiability constraints.

I Note that in this case we could have worked with


β̂ = (X T X )− X T y so that c T β̂ is the LSE (and BLUE also) of
any estimable parametric function c T β.
Alternative Way

I Another way to deal with singular (X T X ) is to impose


identifiability constraints.

I Note that in this case we could have worked with


β̂ = (X T X )− X T y so that c T β̂ is the LSE (and BLUE also) of
any estimable parametric function c T β.

I However in that case we call β̂ only to be a solution of the


normal equations- it may not be LSE (or BLUE) of β, in fact
some or all the model parameters may not be estimable in this
case. This causes problems in some linear models where
estimating the model parameters is an important issue.
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly


independent rows or columns.
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly


independent rows or columns.

I Then we consider a matrix H with rows from the


orthcomplement space of R(X ) such that
" #
X
Rank = p.
H
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly


independent rows or columns.

I Then we consider a matrix H with rows from the


orthcomplement space of R(X ) such that
" #
X
Rank = p.
H

I It is enough if we take p − r linearly independent vectors from


R(X )⊥ .
Identifiability Constraints

I Let Rank(Xn×p ) = r < p, which means X has r linearly


independent rows or columns.

I Then we consider a matrix H with rows from the


orthcomplement space of R(X ) such that
" #
X
Rank = p.
H

I It is enough if we take p − r linearly independent vectors from


R(X )⊥ .

I And then we impose the constraint

Hβ = 0.
Identifiability Constraints (Contd.)

I Then we note that


 T    
X X  X
β = XT HT β
H H H

= (X T X + H T H)β
= (X T X )β = X T y .
Identifiability Constraints (Contd.)

I Then we note that


 T    
X X  X
β = XT HT β
H H H

= (X T X + H T H)β
= (X T X )β = X T y .

I Hence we get a new system of linear equations

(X T X + H T H)β = X T y .
Identifiability Constraints
I Further
"    # "  #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
Identifiability Constraints
I Further
"    # "  #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.
Identifiability Constraints
I Further
"    # "  #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.

I This means
(X T X + H T H)β = X T y
has a unique solution
Identifiability Constraints
I Further
"    # "  #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.

I This means
(X T X + H T H)β = X T y
has a unique solution

I The unique solution is given by

β̂ = (X T X + H T H)−1 X T y .
Identifiability Constraints
I Further
"    # "  #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H

I And (X T X + H T H) is a p × p matrix.

I This means
(X T X + H T H)β = X T y
has a unique solution

I The unique solution is given by

β̂ = (X T X + H T H)−1 X T y .

I The problem with this approach is β̂ is dependent on the choice of linear


constraints H and there can be many such constraints, which means β̂ is
not unique.
Example

I Recall our old example, of a linear model having one factor


covariate with k levels A1 , A2 , ..., Ak .
Example

I Recall our old example, of a linear model having one factor


covariate with k levels A1 , A2 , ..., Ak .

I We consider the alternative parametrization


k
X
y =α+ βi xi + .
i=1
Example

I Recall our old example, of a linear model having one factor


covariate with k levels A1 , A2 , ..., Ak .

I We consider the alternative parametrization


k
X
y =α+ βi xi + .
i=1

I Recall that if we suitably arrange y 0 s then the design matrix


has a form  
1n1 B1
 1n2 B2 
 
X =  

 1n Bk−1 
k−1
1nk Bk
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is


singular. We have already discussed one least square solution
in this case.
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is


singular. We have already discussed one least square solution
in this case.

I Here we can impose constraints to make the parameters


identifiable.
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is


singular. We have already discussed one least square solution
in this case.

I Here we can impose constraints to make the parameters


identifiable.
P P
I For example,
P one can impose ni βi = 0 or βi = 0 or in
general wi βi = 0
Example (Contd.)

I Clearly here X is not of full column rank and as such X T X is


singular. We have already discussed one least square solution
in this case.

I Here we can impose constraints to make the parameters


identifiable.
P P
I For example,
P one can impose ni βi = 0 or βi = 0 or in
general wi βi = 0

I The final least square estimates will change depending on


which constraint we use.
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).

I As such  
T 0 0
H H= T .
0 n(1) n(1)
P
Working with constraint ni βi = 0

I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).

I As such  
T 0 0
H H= T .
0 n(1) n(1)

I Recall that  T
n n(1)

T
X X =
n(1) D
where D = diag (n1 , n2 , ..., nk ).
I This implies
 T 
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)
I This implies
 T 
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)

I Need to find the inverse of this matrix.


I This implies
 T 
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)

I Need to find the inverse of this matrix.

I Recall two facts from matrix algebra


I Assuming S to be non-singular
−1 
F −1 −FQS −1
 
P Q
=
R S −S −1 RF −1 S −1 + S −1 RF −1 QS −1

where F = P − QS −1 R.
I If A is non-singular

A−1 uv T A−1
(A + uv T )−1 = A−1 − .
1 + v T A−1 u
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.

I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.

I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)

k 
X  Xk
T
I Now n(1) D −1 n(1) = ni2 n1i = ni = n.
i=1 i=1
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.

I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)

k 
X  Xk
T
I Now n(1) D −1 n(1) = ni2 n1i = ni = n.
i=1 i=1

I Further
1
  
n1
0 ··· 0 n1
1
0
n2
··· 0  n2 

D −1 n(1) = 

 .. .. .. ..  ..  = 1k .


. . . . .

1
0 0 ··· nk
nk
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1

I This is S −1 in our general form of the first fact.


I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1

I This is S −1 in our general form of the first fact.

I As such QS −1 R is

T −1
 1 
T
n(1) (D + n(1) n(1) T
) n(1) = n(1) D −1 − 1k 1Tk n(1)
n+1
T
n(1) 1k 1Tk n(1)
T
= n(1) D −1 n(1) −
n+1
n2 n
=n− = .
n+1 n+1
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .

I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1

I This is S −1 in our general form of the first fact.

I As such QS −1 R is

T −1
 1 
T
n(1) (D + n(1) n(1) T
) n(1) = n(1) D −1 − 1k 1Tk n(1)
n+1
T
n(1) 1k 1Tk n(1)
T
= n(1) D −1 n(1) −
n+1
n2 n
=n− = .
n+1 n+1

I Thus F = p − QS −1 R = n − n2
n
n+1
= n+1
and F −1 = n+1
n2
.
I Next we find F −1 QS −1 as
n+1 T T −1
n (D + n(1) n(1) )
n2 (1)
n + 1 T  −1 1k 1T
k

= n D −
n2 (1) n+1
T T
n + 1 T n(1) 1k 1k
= 1 −
n2 k n2
n+1 T 1 1
= 1k − 2 (n, n, ...n) = 2 1T
n 2 n n k
I Further RF −1 QS −1 is

n1 n1
   
 .  1 1
 1  .   1 T
 . 
 .  n2 ··· =  .  1 ··· 1 = n(1) 1k .
n2 n2  .  n2
nk nk
I Further RF −1 QS −1 is

n1 n1
   
 .  1 1
 1  .   1 T
 . 
 .  n2 ··· =  .  1 ··· 1 = n(1) 1k .
n2 n2  .  n2
nk nk

I This implies S −1 RF −1 QS −1 is

T −1 1 T 1 −1 1 T T
(D + n(1) n(1) ) n(1) 1k = (D − 1k 1k )n(1) 1k
n2 n2 n+1

1 T 1 T T
= 1k 1k − 1k (1k n(1) )1k
n2 n+1

1 T n T
= 1k 1k − 1k 1k
n2 n+1

1 T
= 1k 1k .
n2 (n + 1)
I Hence S −1 + S −1 RF −1 QS −1 is

−1 1 T 1 T
D − 1k 1k + 1k 1k
n+1 n2 (n + 1)

−1 n−1 T
=D − 1k 1 k
n2
 1
− n−1 − n−1 ··· − n−1

n n2 2 2
 1 n−1
n n
1 − n−1 − n−1
 − 2 ···

 n n2 2 n n2 

= . . . .

. . . .
 
. . . .
 
 
n−1
− 2 − n−1 ··· 1 − n−1
n 2 n nk 2 n
I Hence S −1 + S −1 RF −1 QS −1 is

−1 1 T 1 T
D − 1k 1k + 1k 1k
n+1 n2 (n + 1)

−1 n−1 T
=D − 1k 1 k
n2
 1
− n−1 − n−1 ··· − n−1

n n2 2 2
 1 n−1
n n
1 − n−1 − n−1
 − 2 ···

 n n2 2 n n2 

= . . . .

. . . .
 
. . . .
 
 
n−1
− 2 − n−1 ··· 1 − n−1
n 2 n nk 2 n

I Finally S −1 RF −1 is

−1 1 T
 n+1
D − 1k 1k n(1)
n+1 n2

n + 1 n  1
= 1k − 1k = 1k .
n2 n+1 n2
I Thus we get
 k ni 
XX
 yij 
 i=1 j=1 
 
 n+1 −1 −1 −1   n1 
2 ···
n2 n2 n2
 X 
 n1 y1j 

1 − n−1 − n−1 n−1  
 α̂ 
− 2 ··· − 2

 n n1 n2 2 n n
  j=1 
β̂1  − 1 1
 
− n−1 − n−1 ··· − n−1  X n2 
 n2 n2 2 n2
 
 . = n2 n
 
.
 .   .
 y2j 
 .  . . . .   
  j=1
 . . . . .   
 . . . . .
 
β̂k  .

− 12 − n−1 − n−1 1 − n−1 .
 
···
n2 n2 n2 .
 
n nk  
 n 
 X k 
y
 
kj
j=1
I Thus we get
k ni k ni
n + 1XX 1 XX
α̂ = yij − yij
n2 i=1 j=1 n2 i=1 j=1
k ni
1 XX
= yij = ȳ00
n i=1 j=1

which is the mean of all the observations, that is, the grand mean.
I Thus we get
k ni k ni
n + 1XX 1 XX
α̂ = yij − yij
n2 i=1 j=1 n2 i=1 j=1
k ni
1 XX
= yij = ȳ00
n i=1 j=1

which is the mean of all the observations, that is, the grand mean.

I Further for each i,

k ni n   ni
1 XX n − 1 X X̀ 1 n−1 X
β̂i = − yij + y`j + + yij
n2 i=1 j=1 n2 j=1
ni n2 j=1
`6=i

k ni k n ni
1 XX n − 1 X X̀ 1 X
=− yij + y`j + yij = ȳi0 − ȳ00
n2 i=1 j=1 n2 j=1
ni j=1
`=1

where ȳi0 is the mean of the observations of the i th group.


Alternative Method

I Want to minimize
n
X
S= (yi − α − β1 x1i − β2 x2i − ... − βk xki )2
i=1
P
w.r.t. α, β1 , ..., βk subject to the condition ni βi = 0.
Alternative Method

I Want to minimize
n
X
S= (yi − α − β1 x1i − β2 x2i − ... − βk xki )2
i=1
P
w.r.t. α, β1 , ..., βk subject to the condition ni βi = 0.

I We understand
k
X
S= Si
i=1

where
ni
X
Si = (yij − α − βi )2 .
j=1
∂S P ∂Si
I Thus setting ∂α = ∂α = 0 we get

ni
k X
X
(yij − α − βi ) = 0
i=1 j=1
P
which implies α̂ = ȳ00 as ni βi = 0.
∂S P ∂Si
I Thus setting ∂α = ∂α = 0 we get

ni
k X
X
(yij − α − βi ) = 0
i=1 j=1
P
which implies α̂ = ȳ00 as ni βi = 0.
∂S ∂Si
I Again from ∂βi = ∂βi = 0 we get

ni
X
(yij − α − βi ) = 0
j=1

which implies β̂i = ȳi0 − ȳ00 .


What about the fitted values and residuals?

I Here the fitted values of y will be

ŷ = X β̂ = X (X T X )− X T y = Hy (say)
What about the fitted values and residuals?

I Here the fitted values of y will be

ŷ = X β̂ = X (X T X )− X T y = Hy (say)

I But then what is E (ŷ ) and Var (ŷ )?


What about the fitted values and residuals?

I Here the fitted values of y will be

ŷ = X β̂ = X (X T X )− X T y = Hy (say)

I But then what is E (ŷ ) and Var (ŷ )?

I We can write

E (ŷ ) = HE (y ) = HX β = X β

and
Var (ŷ ) = σ 2 H = σ 2 X (X T X )− X T .
Fitted values and residuals (Contd.)

I We note that there may exist many g-inverses of (X T X ) but


for any such g-inverse H = X (X T X )− X T will be same.
Fitted values and residuals (Contd.)

I We note that there may exist many g-inverses of (X T X ) but


for any such g-inverse H = X (X T X )− X T will be same.

I Thus combining both the cases of non-singular and singular


X T X we can write the unique fitted values are

ŷ = Hy

where H = X (X T X )− X T and E (ŷ ) = X β and


Var (ŷ ) = σ 2 H.
Fitted values and residuals (Contd.)

I We note that there may exist many g-inverses of (X T X ) but


for any such g-inverse H = X (X T X )− X T will be same.

I Thus combining both the cases of non-singular and singular


X T X we can write the unique fitted values are

ŷ = Hy

where H = X (X T X )− X T and E (ŷ ) = X β and


Var (ŷ ) = σ 2 H.

I Further the residuals are given by

e = y − ŷ = (In − H)y .
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a


parametric function c T β.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a


parametric function c T β.

I The fact is that β is estimable means every component βi is


estimable.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a


parametric function c T β.

I The fact is that β is estimable means every component βi is


estimable.

I For each βi , if we express as a linear parametric function c T β,


the choice of c is i th unit vector ui .
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a


parametric function c T β.

I The fact is that β is estimable means every component βi is


estimable.

I For each βi , if we express as a linear parametric function c T β,


the choice of c is i th unit vector ui .

I Thus every such βi will be estimable if all the p unit vectors ui


belongs to the row space of X , which amounts to say, that X
is of full column rank.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.

I Recall that we learned what is meant by estimability of a


parametric function c T β.

I The fact is that β is estimable means every component βi is


estimable.

I For each βi , if we express as a linear parametric function c T β,


the choice of c is i th unit vector ui .

I Thus every such βi will be estimable if all the p unit vectors ui


belongs to the row space of X , which amounts to say, that X
is of full column rank.

I Hence all the parameters in a linear model are only estimable


iff the design matrix is of full column rank.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a


single time.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a


single time.

I Similarly we can saythat C β is estimable iff


" X #
Rank(X ) = Rank − .
C
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a


single time.

I Similarly we can saythat C β is estimable iff


" X #
Rank(X ) = Rank − .
C

I This explains why in case of singular (X T X ) we are taking


about the estimation of X β.
Estimability (Contd.)
I This discussion also extends the idea about parametric
functions, because now we can talk about linear parametric
functions C β where Ck×p is a matrix.

I This is same as considering k linear parametric functions at a


single time.

I Similarly we can saythat C β is estimable iff


" X #
Rank(X ) = Rank − .
C

I This explains why in case of singular (X T X ) we are taking


about the estimation of X β.

I Since X β is estimable, the least square estimate ŷ = X β̂ is the


BLUE of X β.
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter


but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter


but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.

I We note that e = (I − H)y which implies e T e = y T (I − H)y .


This quantity is called the residual sum of squares (RSS) or
sum of squares due to error (SSE).
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter


but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.

I We note that e = (I − H)y which implies e T e = y T (I − H)y .


This quantity is called the residual sum of squares (RSS) or
sum of squares due to error (SSE).

I Recall that if X has mean µ and variance Σ, then


E (X T AX ) = µT Aµ + tr (AΣ)
Estimation of σ 2

I In linear models, the error variance σ 2 is a nuisance parameter


but still we shall estimate it. An unbiased estimator of σ 2 can
be constructed from the residuals.

I We note that e = (I − H)y which implies e T e = y T (I − H)y .


This quantity is called the residual sum of squares (RSS) or
sum of squares due to error (SSE).

I Recall that if X has mean µ and variance Σ, then


E (X T AX ) = µT Aµ + tr (AΣ)

I Then
E (e T e) = E (y T (I − H)y )
= (X β)T (I − H)(X β) + Tr (σ 2 (I − H))
Estimation (Contd.)

I Now (X β)T (I − H)(X β) = 0 and

Tr (I − H) = Tr (I ) − Tr (H)

= n − Rank(H) = n − r
if Rank(X ) = r .
Estimation (Contd.)

I Now (X β)T (I − H)(X β) = 0 and

Tr (I − H) = Tr (I ) − Tr (H)

= n − Rank(H) = n − r
if Rank(X ) = r .
T
e e
I Thus n−r is an unbiased estimate of σ 2 . We call it least square
estimate of σ 2 : that does not mean we apply least squares to
get the estimator, rather the estimator is obtained from least
square residuals. We denote it as

c2 LSE = SSE .
σ
n−r
Maximum Likelihood Estimation

I For likelihood estimation it is mandatory that we have some


distributional assumption regarding . Assume  ∼ Nn (0, σ 2 In )
Maximum Likelihood Estimation

I For likelihood estimation it is mandatory that we have some


distributional assumption regarding . Assume  ∼ Nn (0, σ 2 In )

I Here β and σ 2 are unknown parameters and the likelihood


function is given by
1 1 T (y −X β)
L(β, σ 2 ) = n e − 2σ2 (y −X β)
(2πσ 2 ) 2

and hence
n 1
` = ln L(β, σ 2 ) = const − ln σ 2 − 2 (y − X β)T (y − X β).
2 2σ
MLE (Contd.)

∂`
I Setting ∂β = 0 we get

1 T
X (y − X β) = 0
σ2
MLE (Contd.)

∂`
I Setting ∂β = 0 we get

1 T
X (y − X β) = 0
σ2

I Now if we set σ 2 > 0, we get

XTXβ = XTy

which implies β̂ = β̂LSE .


MLE (Contd.)

∂`
I Setting ∂β = 0 we get

1 T
X (y − X β) = 0
σ2

I Now if we set σ 2 > 0, we get

XTXβ = XTy

which implies β̂ = β̂LSE .

I Hence
L(β, σ 2 ) ≤ L(β̂, σ 2 ) for all σ 2 > 0
with equality iff β = β̂.
MLE (Contd.)

I Let us now maximize L(β̂, σ 2 ) with respect to σ 2 .


MLE (Contd.)

I Let us now maximize L(β̂, σ 2 ) with respect to σ 2 .


∂`
I Setting ∂σ 2
= 0 we get

1  (y − X β̂)T (y − X β̂) 
− n = 0.
σ2 σ2
MLE (Contd.)

I Let us now maximize L(β̂, σ 2 ) with respect to σ 2 .


∂`
I Setting ∂σ 2
= 0 we get

1  (y − X β̂)T (y − X β̂) 
− n = 0.
σ2 σ2

I From σ 2 > 0 we get,

c2 = 1 (y − X β̂)T (y − X β̂).
σ
n
MLE (Contd.)

I Thus we get for all β and σ 2 > 0,

c2 ) ≥ L(β̂, σ 2 ) ≥ L(β, σ 2 ).
L(β̂, σ
MLE (Contd.)

I Thus we get for all β and σ 2 > 0,

c2 ) ≥ L(β̂, σ 2 ) ≥ L(β, σ 2 ).
L(β̂, σ

I Hence finally we get

β̂MLE = β̂LSE

and
c2 MLE = 1 (y − X β̂)T (y − X β̂) = SSE .
σ
n n
General linear hypothesis
I Consider a linear model

y = X β + .
General linear hypothesis
I Consider a linear model

y = X β + .

I A general linear hypothesis is of the form

H0 : Lβ = c

where Lq×p is of rank q.


General linear hypothesis
I Consider a linear model

y = X β + .

I A general linear hypothesis is of the form

H0 : Lβ = c

where Lq×p is of rank q.

I For testing such hypothesis first we need to have a


distributional assumption  ∼ Nn (0, σ 2 I ) which means
y ∼ N(X β, σ 2 I ).
General linear hypothesis
I Consider a linear model

y = X β + .

I A general linear hypothesis is of the form

H0 : Lβ = c

where Lq×p is of rank q.

I For testing such hypothesis first we need to have a


distributional assumption  ∼ Nn (0, σ 2 I ) which means
y ∼ N(X β, σ 2 I ).

I Any linear hypothesis H0 : Lβ = c is called testable, if Lβ is


estimable, that is R(L) ⊂ R(X ).
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
!
(1)
I Partition β as β =
βp1 ×1
(2) where p1 + p2 = p. Suppose we wish to test
βp2 ×1
H0 : β (2)

= 0. Then we choose Lp2 ×p1 = 0 Ip2 .
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
!
(1)
I Partition β as β =
βp1 ×1
(2) where p1 + p2 = p. Suppose we wish to test
βp2 ×1
H0 : β (2)

= 0. Then we choose Lp2 ×p1 = 0 Ip2 .
I Often we need to test H0 : β1 = β2 = ... = βp = 0. The hypothesis can
be written as 
β1 − β2 =


0
β1 − β3 =
 0
H0 : . .. .

 .. .


β1 − βp =

0
Here the choice of L is
1 −1 0 ··· 0
 
1 0 −1 ··· 0
Lp−1×p = . .. .. .. ..  .
 
 .. . . . . 
1 0 0 ··· −1
Choices of L
I Suppose we wish to test H0 : β = β0 . Then we take L = Ip and c = β0 .
!
(1)
I Partition β as β =
βp1 ×1
(2) where p1 + p2 = p. Suppose we wish to test
βp2 ×1
H0 : β (2)

= 0. Then we choose Lp2 ×p1 = 0 Ip2 .
I Often we need to test H0 : β1 = β2 = ... = βp = 0. The hypothesis can
be written as 
β1 − β2 =


0
β1 − β3 =
 0
H0 : . .. .

 .. .


β1 − βp =

0
Here the choice of L is
1 −1 0 ··· 0
 
1 0 −1 ··· 0
Lp−1×p = . .. .. .. ..  .
 
 .. . . . . 
1 0 0 ··· −1

I For testing a linear combination β as H0 : `T β = m we may choose


L = `T and c = m is a scalar.
First fundamental theorem

I Suppose we assume  ∼ Nn (0, σ 2 ), that is, y ∼ Nn (X β, σ 2 ).


Let
R02 = SSE = min(y − X β)T (y − X β).
β

Then R02 ∼ σ 2 χ2n−r where r = Rank(X ).


Second fundamental theorem
I Let y ∼ Nn (X β, σ 2 In ) where Rank(Xn×p ) = r and Lq×p be a
matrix of rank q such that R(L) ⊂ R(X ). Further let us define

R02 = SSE = min(y − X β)T (y − X β)


β

and
R12 = SSEH0 = min (y − X β)T (y − X β).
β:Lβ=c

where c is known. Then


I R02 and R12 − R02 are independently distributed.
I R02 ∼ σ 2 χ2n−r and R12 − R02 has a non-central χ2 with q
degrees of freedom.
I If Lβ = c is true then R12 − R02 ∼ σ 2 χ2q and

R12 − R02 R02


÷ ∼ Fq,n−r .
q n−r
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .

I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .

I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .

I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .

I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .

I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .

I But how large?


Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .

I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .

I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .

I But how large?

I Here the value of SSE serves as the benchmark for judging SSH0 .
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .

I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .

I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .

I But how large?

I Here the value of SSE serves as the benchmark for judging SSH0 .

I Hence ideally the ratio SSH0/SSE should serve as the test statistic. But this ratio does not have
any standard distribution.
Testing H0 : Motivation 1

I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .

I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .

I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .

I But how large?

I Here the value of SSE serves as the benchmark for judging SSH0 .

I Hence ideally the ratio SSH0/SSE should serve as the test statistic. But this ratio does not have
any standard distribution.

I Hence, in order to make a test statistic with a standard distribution we need to consider
F = MSH0/MSE which according to previous discussion will have F distrbution under H0 .
Confusion with the word ANOVA

I The word ANOVA is generally used with two meanings in the


literature.
Confusion with the word ANOVA

I The word ANOVA is generally used with two meanings in the


literature.

I This generally adds confusion to the learners who are getting


acquainted with the term very recently and that is why in
attempts of clarification, some even more confusing terms like
“ANOVA in Regression” appear in some illustrations.
Confusion with the word ANOVA

I The word ANOVA is generally used with two meanings in the


literature.

I This generally adds confusion to the learners who are getting


acquainted with the term very recently and that is why in
attempts of clarification, some even more confusing terms like
“ANOVA in Regression” appear in some illustrations.

I Here we mention at the outset that the word ANOVA will be


used under two meanings throughout our entire journey of
linear models:
I ANOVA as a model.
I ANOVA as a technique.
ANOVA as technique

I ANOVA as technique starts with our basic understanding


regarding statistical modeling.
ANOVA as technique

I ANOVA as technique starts with our basic understanding


regarding statistical modeling.

I all items in any collection will not be identical in every aspect


and they always exhibit some variability with respect to any
study variable.
ANOVA as technique

I ANOVA as technique starts with our basic understanding


regarding statistical modeling.

I all items in any collection will not be identical in every aspect


and they always exhibit some variability with respect to any
study variable.

I A statistician’s job is to investigate the cause of the variation


ANOVA as technique

I ANOVA as technique starts with our basic understanding


regarding statistical modeling.

I all items in any collection will not be identical in every aspect


and they always exhibit some variability with respect to any
study variable.

I A statistician’s job is to investigate the cause of the variation

I The process of identifying the potential causes of variation is


accomplished by constructing models
I In every statistical model studied, we believe that the causes of
variation can be classified in two categories:
I Assignable Causes
I Chance Causes.
I In every statistical model studied, we believe that the causes of
variation can be classified in two categories:
I Assignable Causes
I Chance Causes.
I While we understand that chance causes are uncontrollable
and represent our ignorance or reluctance in a statistical
model, it may seem that they are useless. But the truth is,
these apparently useless stuffs provide us a benchmark for
understanding how significant the assignable causes are, that
is, the errors themselves are useless but they helps us to
understand the importance of other causes in the model.
ANOVA as technique
I When we have some potential factors or covariates affecting
the response variable, ANOVA splits the total sum of squares
into component sum of squares which reflects amount of
variation in the response due to each individual factor or
covariate.
ANOVA as technique
I When we have some potential factors or covariates affecting
the response variable, ANOVA splits the total sum of squares
into component sum of squares which reflects amount of
variation in the response due to each individual factor or
covariate.
I This technique of splitting the total sum of squares is called
analysis of variance.
ANOVA as technique
I When we have some potential factors or covariates affecting
the response variable, ANOVA splits the total sum of squares
into component sum of squares which reflects amount of
variation in the response due to each individual factor or
covariate.
I This technique of splitting the total sum of squares is called
analysis of variance.
I The remaining idea is that we can then compare this individual
sum of squares to the sum of squares due to the chance cause
to understand how serious is the effect of that particular
covariate.
ANOVA as technique
I When we have some potential factors or covariates affecting
the response variable, ANOVA splits the total sum of squares
into component sum of squares which reflects amount of
variation in the response due to each individual factor or
covariate.
I This technique of splitting the total sum of squares is called
analysis of variance.
I The remaining idea is that we can then compare this individual
sum of squares to the sum of squares due to the chance cause
to understand how serious is the effect of that particular
covariate.
I The amount of variation due to chance causes represent the
allowable extent of variation and any larger variation than that
allowable value will force us to think that the cause of the
variation is serious, or in other words, that factor or covariate
has serious impact on the response.
ANOVA as technique
I A natural question is how large should be that component sum
of squares to make us think the corresponding effect is
serious? We could have simple compare the ratio with 1 and if
this comes out to be more than 1 we may take that covariate
effect to be significant.
ANOVA as technique
I A natural question is how large should be that component sum
of squares to make us think the corresponding effect is
serious? We could have simple compare the ratio with 1 and if
this comes out to be more than 1 we may take that covariate
effect to be significant.
I But we are working with samples and we need to allow the
sampling variability of that ratio.
ANOVA as technique
I A natural question is how large should be that component sum
of squares to make us think the corresponding effect is
serious? We could have simple compare the ratio with 1 and if
this comes out to be more than 1 we may take that covariate
effect to be significant.
I But we are working with samples and we need to allow the
sampling variability of that ratio.
I Hence to obtain a threshold value we need to study the
sampling distribution of the ratio.
ANOVA as technique
I A natural question is how large should be that component sum
of squares to make us think the corresponding effect is
serious? We could have simple compare the ratio with 1 and if
this comes out to be more than 1 we may take that covariate
effect to be significant.
I But we are working with samples and we need to allow the
sampling variability of that ratio.
I Hence to obtain a threshold value we need to study the
sampling distribution of the ratio.
I If we assume normality of the response values, then generally
the sum of squares term follow a χ2 distribution.
ANOVA as technique
I A natural question is how large should be that component sum
of squares to make us think the corresponding effect is
serious? We could have simple compare the ratio with 1 and if
this comes out to be more than 1 we may take that covariate
effect to be significant.
I But we are working with samples and we need to allow the
sampling variability of that ratio.
I Hence to obtain a threshold value we need to study the
sampling distribution of the ratio.
I If we assume normality of the response values, then generally
the sum of squares term follow a χ2 distribution.
I Hence the ratio after adjustments for degrees of freedom will
have a F distribution with appropriate degrees of freedom.
ANOVA as technique
I A natural question is how large should be that component sum
of squares to make us think the corresponding effect is
serious? We could have simple compare the ratio with 1 and if
this comes out to be more than 1 we may take that covariate
effect to be significant.
I But we are working with samples and we need to allow the
sampling variability of that ratio.
I Hence to obtain a threshold value we need to study the
sampling distribution of the ratio.
I If we assume normality of the response values, then generally
the sum of squares term follow a χ2 distribution.
I Hence the ratio after adjustments for degrees of freedom will
have a F distribution with appropriate degrees of freedom.
I Thus ANOVA as a technique compares the variation in the
form of a F test.
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.

I This means if there is variation in the response variable y upto the


amount of SSE we consider that variation to be beyond our control and
hence allowable.
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.

I This means if there is variation in the response variable y upto the


amount of SSE we consider that variation to be beyond our control and
hence allowable.

I ANOVA approach tests for the significance of any hypothesis by


comparing the sum of squares.
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.

I This means if there is variation in the response variable y upto the


amount of SSE we consider that variation to be beyond our control and
hence allowable.

I ANOVA approach tests for the significance of any hypothesis by


comparing the sum of squares.

I Thus to test whether H0 is significant we need to compare SSH0 with


SSE .
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.

I This means if there is variation in the response variable y upto the


amount of SSE we consider that variation to be beyond our control and
hence allowable.

I ANOVA approach tests for the significance of any hypothesis by


comparing the sum of squares.

I Thus to test whether H0 is significant we need to compare SSH0 with


SSE .

I This is comparing the variation in y caused due to H0 with our


benchmark SSE .
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.

I This means if there is variation in the response variable y upto the


amount of SSE we consider that variation to be beyond our control and
hence allowable.

I ANOVA approach tests for the significance of any hypothesis by


comparing the sum of squares.

I Thus to test whether H0 is significant we need to compare SSH0 with


SSE .

I This is comparing the variation in y caused due to H0 with our


benchmark SSE .

I If SSH0 comes out significantly larger than SSE , we conclude that the
variation caused due to H0 is significant and non-ignorable.
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.

I This means if there is variation in the response variable y upto the


amount of SSE we consider that variation to be beyond our control and
hence allowable.

I ANOVA approach tests for the significance of any hypothesis by


comparing the sum of squares.

I Thus to test whether H0 is significant we need to compare SSH0 with


SSE .

I This is comparing the variation in y caused due to H0 with our


benchmark SSE .

I If SSH0 comes out significantly larger than SSE , we conclude that the
variation caused due to H0 is significant and non-ignorable.

SSH0 /df1
I Hence we shall compute the ratio and this should be our test
SSE /df2
statistic for testing H0 .
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
I We can incorporate the precision of each β̂ into a a suitable
distance measure is to use the quadratic
(Lβ̂ − c)T [Var (Lβ̂)]−1 (Lβ̂ − c)
where Var (Lβ̂) = LVar (β̂)LT = σ 2 L(X T X )− LT .
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
I We can incorporate the precision of each β̂ into a a suitable
distance measure is to use the quadratic
(Lβ̂ − c)T [Var (Lβ̂)]−1 (Lβ̂ − c)
where Var (Lβ̂) = LVar (β̂)LT = σ 2 L(X T X )− LT .
I If we estimate σ 2 by its unbiased estimate SSE 2
n−r = σ̂ , we arrive
at
(Lβ̂ − c)T [L(X T X )− LT ]−1 (Lβ̂ − c)
σ̂ 2
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
I We can incorporate the precision of each β̂ into a a suitable
distance measure is to use the quadratic
(Lβ̂ − c)T [Var (Lβ̂)]−1 (Lβ̂ − c)
where Var (Lβ̂) = LVar (β̂)LT = σ 2 L(X T X )− LT .
I If we estimate σ 2 by its unbiased estimate SSE 2
n−r = σ̂ , we arrive
at
(Lβ̂ − c)T [L(X T X )− LT ]−1 (Lβ̂ − c)
σ̂ 2
I Our test statistic will be a constant times this quadratic
measure.
F-test for general linear hypothesis
I Consider a linear model

yn×1 = Xn×p βp×1 + n×1


2
where  ∼ Nn (0, σ In ) and Rank(X ) = r .
F-test for general linear hypothesis
I Consider a linear model

yn×1 = Xn×p βp×1 + n×1


2
where  ∼ Nn (0, σ In ) and Rank(X ) = r .

I Consider testing a hypothesis H0 : Lβ = c where Lq×p is of rank q and c


is known.
F-test for general linear hypothesis
I Consider a linear model

yn×1 = Xn×p βp×1 + n×1


2
where  ∼ Nn (0, σ In ) and Rank(X ) = r .

I Consider testing a hypothesis H0 : Lβ = c where Lq×p is of rank q and c


is known.

I The hypothesis H0 is called testable iff R(L) ⊂ R(X ).


F-test for general linear hypothesis
I Consider a linear model

yn×1 = Xn×p βp×1 + n×1


2
where  ∼ Nn (0, σ In ) and Rank(X ) = r .

I Consider testing a hypothesis H0 : Lβ = c where Lq×p is of rank q and c


is known.

I The hypothesis H0 is called testable iff R(L) ⊂ R(X ).

I A test statistic for testing H0 is given by

R12 − R02 R02


F = ÷
q n−r
which under H0 has Fq,n−r distribution.
F-test for general linear hypothesis
I Consider a linear model

yn×1 = Xn×p βp×1 + n×1


2
where  ∼ Nn (0, σ In ) and Rank(X ) = r .

I Consider testing a hypothesis H0 : Lβ = c where Lq×p is of rank q and c


is known.

I The hypothesis H0 is called testable iff R(L) ⊂ R(X ).

I A test statistic for testing H0 is given by

R12 − R02 R02


F = ÷
q n−r
which under H0 has Fq,n−r distribution.

I Here
R02 = min(y − X β)T (y − X β)
β

and
R12 = min (y − X β)T (y − X β).
β:Lβ=c
I In our new notation the test statistic can be written as
SSH0/q MSH0
F = SSE/n−r
= ∼ Fq,n−r under H0 .
MSE
I In our new notation the test statistic can be written as
SSH0/q MSH0
F = SSE/n−r
= ∼ Fq,n−r under H0 .
MSE

I The test statistic can also be expressed in an alternative form


as
(Lβ̂ − c)T [L(X T X )− LT ]−1 (Lβ̂ − c)
F =
qσ̂ 2
or sometimes even in a more convenient form as

(Lβ̂ − c)T [ Varσ(L


2
β̂) −1
] (Lβ̂ − c)
F = 2
.
qσ̂
Testing a linear combination
I One particular choice of L needs special mention.
Testing a linear combination
I One particular choice of L needs special mention.
I Suppose we want to test any linear combination of β, that is,
H0 : `T β = c.
Testing a linear combination
I One particular choice of L needs special mention.
I Suppose we want to test any linear combination of β, that is,
H0 : `T β = c.
I We shall assume ` ∈ R(X ), so that H0 is testable.
Testing a linear combination
I One particular choice of L needs special mention.
I Suppose we want to test any linear combination of β, that is,
H0 : `T β = c.
I We shall assume ` ∈ R(X ), so that H0 is testable.
I Then what we find is that the test statistic simplifies to
(`T β̂ − c)2
F =h i ∼ F1,n−r under H0 .
Var (`T β̂)
σ 2 σ̂ 2
Testing a linear combination
I One particular choice of L needs special mention.
I Suppose we want to test any linear combination of β, that is,
H0 : `T β = c.
I We shall assume ` ∈ R(X ), so that H0 is testable.
I Then what we find is that the test statistic simplifies to
(`T β̂ − c)2
F =h i ∼ F1,n−r under H0 .
Var (`T β̂)
σ 2 σ̂ 2

Since F1,n−r = tn−r 2


I , we can also perform a t-test using the
test statistic
v
u (` β̂ − c)2 |`T β̂ − c|
u T
t = th
Var (`T β̂)
i
2
= r h T β̂)
i ∼ tn−r under H0 .
σ2
σ̂ σ̂ Var (`
σ2
Example
I Consider the linear model

y1 = α1 + 1

y2 = 2α1 − α2 + 2
y3 = α1 + 2α2 + 3
where  ∼ N3 (0, σ 2 I3 ).
Example
I Consider the linear model

y1 = α1 + 1

y2 = 2α1 − α2 + 2
y3 = α1 + 2α2 + 3
where  ∼ N3 (0, σ 2 I3 ).

I Suppose we want to test H0 : α1 = α2 .


Example
I Consider the linear model

y1 = α1 + 1

y2 = 2α1 − α2 + 2
y3 = α1 + 2α2 + 3
where  ∼ N3 (0, σ 2 I3 ).

I Suppose we want to test H0 : α1 = α2 .

I First we note that the design matrix is


 
1 0
X = 2 −1
1 2

which is of rank 2.
I Also the hypothesis can be written as

H0 : Lβ = 0
 
α1
where L = (1, −1) and β = where Rank(L) = q = 1.
α2
I Also the hypothesis can be written as

H0 : Lβ = 0
 
α1
where L = (1, −1) and β = where Rank(L) = q = 1.
α2

I We note that L ∈ R(X ) (because X is of full column rank)


which means H0 is testable.
I Also the hypothesis can be written as

H0 : Lβ = 0
 
α1
where L = (1, −1) and β = where Rank(L) = q = 1.
α2

I We note that L ∈ R(X ) (because X is of full column rank)


which means H0 is testable.

I Now  
T 6 0
X X =
0 5
and  
T y1 + 2y2 + y3
X y= .
−y2 + 2y3
I Thus
  1   1 
α̂1 0 y1 + 2y2 + y3 (y1 + 2y2 + y3 )
β̂ = = 6 1 = 6
1
α̂2 0 5 −y2 + 2y3 5 (2y3 − y2 )
I Thus
  1   1 
α̂1 0 y1 + 2y2 + y3 (y1 + 2y2 + y3 )
β̂ = = 6 1 = 6
1
α̂2 0 5 −y2 + 2y3 5 (2y3 − y2 )

I Then

SSE = (y − X β̂)T (y − X β̂) = y T y − β̂ T X T X β̂


X
= yi2 − 6α̂12 − 5α̂22 .
I Thus
  1   1 
α̂1 0 y1 + 2y2 + y3 (y1 + 2y2 + y3 )
β̂ = = 6 1 = 6
1
α̂2 0 5 −y2 + 2y3 5 (2y3 − y2 )

I Then

SSE = (y − X β̂)T (y − X β̂) = y T y − β̂ T X T X β̂


X
= yi2 − 6α̂12 − 5α̂22 .

I Now we demonstrate two methods of finding the F-statistic.


Method I

I We note that Lβ̂ = α̂1 − α̂2 .


Method I

I We note that Lβ̂ = α̂1 − α̂2 .

I Also
1  
T −1 T

6 0 1 11
L(X X ) L = 1 −1 1 = .
0 5 −1 30
Method I

I We note that Lβ̂ = α̂1 − α̂2 .

I Also
1  
T −1 T

6 0 1 11
L(X X ) L = 1 −1 1 = .
0 5 −1 30

I Thus the F -statistic is given by

(α̂1 − α̂2 )2 (α̂1 − α̂2 )2


F = 11 2
= 11
∼ F1,1 .
30 σ̂ 30 SSE
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .

I Instead of applying the general formula we can directly minimize T 


under the restriction α1 = α2 = α (say).
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .

I Instead of applying the general formula we can directly minimize T 


under the restriction α1 = α2 = α (say).
 
I Thus setting ∂
∂α
(y1 − α)2 + (y2 − α)2 + (y3 − 3α)2 = 0 we get
1
α̂H = 11
(y1 + y2 + 3y3 ).
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .

I Instead of applying the general formula we can directly minimize T 


under the restriction α1 = α2 = α (say).
 
I Thus setting ∂
∂α
(y1 − α)2 + (y2 − α)2 + (y3 − 3α)2 = 0 we get
1
α̂H = 11
(y1 + y2 + 3y3 ).

I Hence
SSEH0 = (y1 − α̂H )2 + (y2 − α̂H )2 + (y3 − 3α̂H )2 .
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .

I Instead of applying the general formula we can directly minimize T 


under the restriction α1 = α2 = α (say).
 
I Thus setting ∂
∂α
(y1 − α)2 + (y2 − α)2 + (y3 − 3α)2 = 0 we get
1
α̂H = 11
(y1 + y2 + 3y3 ).

I Hence
SSEH0 = (y1 − α̂H )2 + (y2 − α̂H )2 + (y3 − 3α̂H )2 .

I The F statistic is given by

SSEH0 − SSE
F = .
SSE
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .

I Instead of applying the general formula we can directly minimize T 


under the restriction α1 = α2 = α (say).
 
I Thus setting ∂
∂α
(y1 − α)2 + (y2 − α)2 + (y3 − 3α)2 = 0 we get
1
α̂H = 11
(y1 + y2 + 3y3 ).

I Hence
SSEH0 = (y1 − α̂H )2 + (y2 − α̂H )2 + (y3 − 3α̂H )2 .

I The F statistic is given by

SSEH0 − SSE
F = .
SSE

I That the two F statistics are algebraically same is a simple verification


left for homework.
Example: Comparison of means of two populations

I Recall the example where we need to judge the effectiveness of


a treatment (or may be comparing the effectiveness of two
treatments, in which case the control group may be thought of
getting some treatment) in either a controlled experiment or
observational study.
Example: Comparison of means of two populations

I Recall the example where we need to judge the effectiveness of


a treatment (or may be comparing the effectiveness of two
treatments, in which case the control group may be thought of
getting some treatment) in either a controlled experiment or
observational study.

I Suppose the data are obtained in the form


Control (Treatment A) Treatment (Treatment B)
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2
I One way of formulating this problem is to assume that we have random
samples from two independent populations : one population getting
Treatment A (or control group) and the other population getting
Treatment B (or the treatment group).
I One way of formulating this problem is to assume that we have random
samples from two independent populations : one population getting
Treatment A (or control group) and the other population getting
Treatment B (or the treatment group).

I Want to compare the means of the two populations, that is we want to


test H0 : µ1 = µ2 against possible alternatives.
I One way of formulating this problem is to assume that we have random
samples from two independent populations : one population getting
Treatment A (or control group) and the other population getting
Treatment B (or the treatment group).

I Want to compare the means of the two populations, that is we want to


test H0 : µ1 = µ2 against possible alternatives.

I Assuming normality and equal variance (that is homoscedasticity) which


means we are assuming
iid iid
y1j ∼ N(µ1 , σ 2 ), j = 1, 2, ..., n1 and y2j ∼ N(µ2 , σ 2 ), j = 1, 2, ..., n2
this reduces to the Fisher’s t-test.
I One way of formulating this problem is to assume that we have random
samples from two independent populations : one population getting
Treatment A (or control group) and the other population getting
Treatment B (or the treatment group).

I Want to compare the means of the two populations, that is we want to


test H0 : µ1 = µ2 against possible alternatives.

I Assuming normality and equal variance (that is homoscedasticity) which


means we are assuming
iid iid
y1j ∼ N(µ1 , σ 2 ), j = 1, 2, ..., n1 and y2j ∼ N(µ2 , σ 2 ), j = 1, 2, ..., n2
this reduces to the Fisher’s t-test.

I The test statistic is given by

ȳ1 − ȳ2
t= q
s n11 + n12

which under H0 has a tn1 +n2 −2 distribution.


I One way of formulating this problem is to assume that we have random
samples from two independent populations : one population getting
Treatment A (or control group) and the other population getting
Treatment B (or the treatment group).

I Want to compare the means of the two populations, that is we want to


test H0 : µ1 = µ2 against possible alternatives.

I Assuming normality and equal variance (that is homoscedasticity) which


means we are assuming
iid iid
y1j ∼ N(µ1 , σ 2 ), j = 1, 2, ..., n1 and y2j ∼ N(µ2 , σ 2 ), j = 1, 2, ..., n2
this reduces to the Fisher’s t-test.

I The test statistic is given by

ȳ1 − ȳ2
t= q
s n11 + n12

which under H0 has a tn1 +n2 −2 distribution.

(n1 −1)s12 +(n2 −1)s22


I Here s 2 is the pooled variance s 2 = where
n1 +n2 −2
(ni − 1)si2 = (yij − ȳi )2 .
P
I Another way to work with this problem is to use a linear model
by introducing dummy variables.
I Another way to work with this problem is to use a linear model
by introducing dummy variables.

I Then the linear model can be written as

z = α + βx + 

or more precisely

zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).

where E () = 0 and D() = σ 2 In .


I Another way to work with this problem is to use a linear model
by introducing dummy variables.

I Then the linear model can be written as

z = α + βx + 

or more precisely

zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).

where E () = 0 and D() = σ 2 In .

I Here (
y1i , i = 1, 2, ..., n1
zi =
y2(i−n1 ) , i = n1 + 1, ..., n1 + n2
and 1 , 2 , ..., n are the random errors.
I If we compare the two formulations we note that in case of
linear models we are assuming

E (y1i ) = α and E (y2i ) = α + β

and hence µ1 = α and µ2 = α + β.


I If we compare the two formulations we note that in case of
linear models we are assuming

E (y1i ) = α and E (y2i ) = α + β

and hence µ1 = α and µ2 = α + β.


0
I Thus testing H0 : µ1 = µ2 is equivalent to test H0 : β = 0.
I If we compare the two formulations we note that in case of
linear models we are assuming

E (y1i ) = α and E (y2i ) = α + β

and hence µ1 = α and µ2 = α + β.


0
I Thus testing H0 : µ1 = µ2 is equivalent to test H0 : β = 0.

I Again if we assume the errors  to be normal then what we get


is

y1i ∼ N(µ1 = α, σ 2 ) and y2i ∼ N(µ2 = α + β, σ 2 ).


I If we compare the two formulations we note that in case of
linear models we are assuming

E (y1i ) = α and E (y2i ) = α + β

and hence µ1 = α and µ2 = α + β.


0
I Thus testing H0 : µ1 = µ2 is equivalent to test H0 : β = 0.

I Again if we assume the errors  to be normal then what we get


is

y1i ∼ N(µ1 = α, σ 2 ) and y2i ∼ N(µ2 = α + β, σ 2 ).

I So from the equivalence in the two setups and the equivalence


in the two hypothesis, there must be some equivalence in the
two test statistics.
I If we compare the two formulations we note that in case of
linear models we are assuming

E (y1i ) = α and E (y2i ) = α + β

and hence µ1 = α and µ2 = α + β.


0
I Thus testing H0 : µ1 = µ2 is equivalent to test H0 : β = 0.

I Again if we assume the errors  to be normal then what we get


is

y1i ∼ N(µ1 = α, σ 2 ) and y2i ∼ N(µ2 = α + β, σ 2 ).

I So from the equivalence in the two setups and the equivalence


in the two hypothesis, there must be some equivalence in the
two test statistics.

I Let us obtain the test statistic for testing the linear hypothesis
0
H0 : β = 0 against H1 : β 6= 0 in case of the linear model.
I First we note that the hypothesis H0 : β = 0 when expressed
as linear hypothesis H : Lθ = 0, the choice of L becomes
L = `T = (0, 1).
I First we note that the hypothesis H0 : β = 0 when expressed
as linear hypothesis H : Lθ = 0, the choice of L becomes
L = `T = (0, 1).

I Now ` ∈ R(X ) and as such H0 : β = 0 is testable and


Rank(L) = 1.
I First we note that the hypothesis H0 : β = 0 when expressed
as linear hypothesis H : Lθ = 0, the choice of L becomes
L = `T = (0, 1).

I Now ` ∈ R(X ) and as such H0 : β = 0 is testable and


Rank(L) = 1.

I Then the test statistic is given by

(`T θ̂)T [`T (X T X )−1 `]−1 (`T θ̂)


F =
qS 2

where θ̂ = (X T X )−1 X T y is the least square estimate of θ.


I Now in this case  
1 0
 1 0
 
 . .. 
 .
 . . 
 1 0
 
Xn×2 = · · · · · ·
 
 1 1
 
 1 1
 
 .. .. 
 
 . . 
1 1
which implies  
n n2
XTX =
n2 n2
and  −1 P P 
n n2 y1iP+ y2i
(X T X )−1 (X T y ) =
n2 n2 y2i
  P P 
1 n2 −n2 y1iP+ y2i
=
n1 n2 −n2 n y2i
I This implies    
α̂ ȳ10
θ̂ = = .
β̂ ȳ20 − ȳ10
I This implies    
α̂ ȳ10
θ̂ = = .
β̂ ȳ20 − ȳ10

I As such `T θ̂ = β̂ = ȳ20 − ȳ10 .


I This implies    
α̂ ȳ10
θ̂ = = .
β̂ ȳ20 − ȳ10

I As such `T θ̂ = β̂ = ȳ20 − ȳ10 .

I Also
  
T T −1
 1 n2 −n2 0 n
` (X X ) `= 0 1 =
n1 n2 −n2 n 1 n1 n2

n1 + n2 1 1
= = + .
n1 n2 n1 n2
I Moreover

SSE = (y − X θ̂)T (y − X θ̂) = y T y − θ̂T X T X θ̂


X X 2   
 n n2 ȳ10
= y1i2 + y2i − ȳ10 ȳ20 − ȳ10
n2 n2 ȳ20 − ȳ10
X X   
= y1i2 + y2i2 − nȳ10
2
+ n2 (ȳ20 − ȳ10 )2 + 2n2 ȳ10 (ȳ20 − ȳ10 )
X X   
= y1i2 + y2i2 − n1 ȳ10
2 2
+ n2 ȳ20
X X
= (y1i − ȳ10 )2 + (y2i − ȳ20 )2 = (n1 − 1)S12 + (n2 − 1)S22 .
I Moreover

SSE = (y − X θ̂)T (y − X θ̂) = y T y − θ̂T X T X θ̂


X X 2   
 n n2 ȳ10
= y1i2 + y2i − ȳ10 ȳ20 − ȳ10
n2 n2 ȳ20 − ȳ10
X X   
= y1i2 + y2i2 − nȳ10
2
+ n2 (ȳ20 − ȳ10 )2 + 2n2 ȳ10 (ȳ20 − ȳ10 )
X X   
= y1i2 + y2i2 − n1 ȳ10
2 2
+ n2 ȳ20
X X
= (y1i − ȳ10 )2 + (y2i − ȳ20 )2 = (n1 − 1)S12 + (n2 − 1)S22 .

I Hence

SSE SSE (n1 − 1)S12 + (n2 − 1)S22


σ̂ 2 = MSE = = = = S2
n−r n−2 n1 + n2 − 2
I Moreover

SSE = (y − X θ̂)T (y − X θ̂) = y T y − θ̂T X T X θ̂


X X 2   
 n n2 ȳ10
= y1i2 + y2i − ȳ10 ȳ20 − ȳ10
n2 n2 ȳ20 − ȳ10
X X   
= y1i2 + y2i2 − nȳ10
2
+ n2 (ȳ20 − ȳ10 )2 + 2n2 ȳ10 (ȳ20 − ȳ10 )
X X   
= y1i2 + y2i2 − n1 ȳ10
2 2
+ n2 ȳ20
X X
= (y1i − ȳ10 )2 + (y2i − ȳ20 )2 = (n1 − 1)S12 + (n2 − 1)S22 .

I Hence

SSE SSE (n1 − 1)S12 + (n2 − 1)S22


σ̂ 2 = MSE = = = = S2
n−r n−2 n1 + n2 − 2

I Thus the test statistic is

(ȳ10 − ȳ20 )2
F =  
S 2 n11 + n12

which under H0 has F1,n−2 distribution F1,n1 +n2 −2 distribution.


I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).

I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are

<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).

I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are

<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.

I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).

I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are

<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.

I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.

I However there is a little difference in the two approaches of comparison.


I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).

I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are

<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.

I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.

I However there is a little difference in the two approaches of comparison.

I Note that the former Fisher’s t−test approach allows the alternative
hypothesis to be one-sided, that is we can choose the alternative to be
H1 : µ1 > µ2 or H1 : µ1 < µ2 .
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).

I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are

<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.

I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.

I However there is a little difference in the two approaches of comparison.

I Note that the former Fisher’s t−test approach allows the alternative
hypothesis to be one-sided, that is we can choose the alternative to be
H1 : µ1 > µ2 or H1 : µ1 < µ2 .

I But the F tests in the linear hypothesis approach allows only for both
sided test H1 : β 6= 0.

You might also like