Linear Model Recap 2
Linear Model Recap 2
Presidency University
February, 2025
Estimation
I We have
!
X
p = Rank(Xn×p ) ≤ Rank ≤ p.
cT (n+1)×p
!
X
⇒ Rank(X ) = Rank whatever c T .
cT (n+1)×p
E (w T y ) = 0 for all β.
E (w T y ) = 0 for all β.
E (w T y ) = 0 for all β.
µ1 + µ2 + µ3 + 1 = 5.1
µ1 + µ2 + µ3 + 2 = 8.2
µ1 − µ2 + 3 = 4.9
−µ1 + µ2 + 4 = 3.1
Example
µ1 + µ2 + µ3 + 1 = 5.1
µ1 + µ2 + µ3 + 2 = 8.2
µ1 − µ2 + 3 = 4.9
−µ1 + µ2 + 4 = 3.1
−1 1 0
Example
but µ1 + µ2 + µ3 is estimable.
Example
but µ1 + µ2 + µ3 is estimable.
but µ1 + µ2 + µ3 is estimable.
I Consider the linear model with one factor covariate having k levels
A1 , A2 , ..., Ak .
Example: Dummy variable regression model
I Consider the linear model with one factor covariate having k levels
A1 , A2 , ..., Ak .
I Let us, for further simplicity, introduce the notation that yij
denotes the j th observation in i th group where j = 1, 2, .., ni
and i = 1, 2, ..., k.
I Then the design matrix X is of the form
k columns
z }| {
1 1 0 0 0
1 1 0 0 0
1 1 0 0 0
− − − − − − − −
1 0 1 0 0
1 0 1 0 0
1 0 1 0 0 1n1 B1
− − − − − − − − 1n2 B2
X =
=
− − − − − − − − 1nk−1 Bk−1
1 0 0 0 1 1nk 0
1 0 0 0 1
1 0 0 0 1
− − − − − − − −
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
I Further recall that X is of full column rank.
I Further recall that X is of full column rank.
I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1
I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
I These are error functions which are linearly independent but
not orthogonal.
Example (Contd.)
I The coefficients `ij are determined by the restrictions
i −1
k nX
X
cov (yk1 + `ij eij , eij ) = 0.
i=1 j=1
I Now for any group other than the k th group, we can choose
the error functions to be yij − yi1 , j = 1, 2, ..., ni , j 6= k.
I These are error functions which are linearly independent but
not orthogonal.
I Since cov (ykj , yij 0 ) = 0 for any i 6= k and any j and j 0 , we get
that
X i −1
k nX
cov (yk1 + `ij eij , eij ) = 0
i=1 j=1
boils down to
k −1
nX
cov (yk1 + `kj ekj , ekj ) = 0.
j=1
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.
Example (Contd.)
I Thus we only need to concentrate on the error functions of the
k th group, and we now want them to be mutually orthogonal.
I Clearly
1
Var (ekj ) = [jσ 2 + j 2 σ 2 ] = σ 2 .
j(j + 1)
and
1 σ2
Cov (yk1 , ekj ) = p , Var (yk1 ) = p
j(j + 1) j(j + 1)
Example (Contd.)
I Clearly
1
Var (ekj ) = [jσ 2 + j 2 σ 2 ] = σ 2 .
j(j + 1)
and
1 σ2
Cov (yk1 , ekj ) = p , Var (yk1 ) = p
j(j + 1) j(j + 1)
I Combining we get
−1
`kj = p .
j(j + 1)
Example (Contd.)
I This means the BLUE is
k −1
nX
yk1 + `kj ekj
j=1
k −1
nX j
1 hX i
= yk1 − yki − jyk(j+1) .
j(j + 1)
j=1 i=1
Example (Contd.)
I This means the BLUE is
k −1
nX
yk1 + `kj ekj
j=1
k −1
nX j
1 hX i
= yk1 − yki − jyk(j+1) .
j(j + 1)
j=1 i=1
(y − X β)T (y − X β)
with respect to β.
Method of least squares
(y − X β)T (y − X β)
with respect to β.
I Setting
∂
(y − X β)T (y − X β) = 0
∂β
we get
−2X T y + 2(X T X )β = 0.
Method of least squares
(y − X β)T (y − X β)
with respect to β.
I Setting
∂
(y − X β)T (y − X β) = 0
∂β
we get
−2X T y + 2(X T X )β = 0.
I This means (X T X )β = X T y
Normal equations are always solvable
(X T X )β = X T y .
Normal equations are always solvable
(X T X )β = X T y .
(X T X )β = X T y .
(X T X )β = X T y .
β̂ = (X T X )−1 (X T y )
y = a + bx + .
Example: Simple Linear Regression
y = a + bx + .
I Further P
y
X y= P i .
T
xi yi
I We note that P
XTX = Pn P x2i
xi xi
and hence
P P
1 −
T
(X X ) −1
= P 2
Pxi 2 xi
n xi2 − − xi n
P
xi
I Further P
y
X y= P i .
T
xi yi
and
nk
X
Sk = (yij − α)2 for observations in k th group.
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1
I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1
Example (contd.)
k
X
I Hence ∂S ∂S i
∂α
= ∂α
.
i=1
I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1
I Setting ∂S
∂α
= 0 we get
ni
k−1 X nk
X X
(yij − α − βi ) + (ykj − α) = 0.
i=1 j=1 j=1
I Hence we get
nk
X
(ykj − α) = 0
j=1
which implies
α̂ = ȳk0 .
Case II: X is not of full column rank
I Both ways has their own merits and demerits and we shall
discuss both of them soon.
Using Generalized Inverse
AA− A = A− AA = A.
Using Generalized Inverse
AA− A = A− AA = A.
β̂ = (X T X )− X T y .
Example (singular X T X )
I We note that
n n1 n2 ··· nk
n1 n1 0
··· 0
X T X = n2 0 n2
··· 0
.. .. ..
. . .
nk 0 0 ··· nk
Example (Contd.)
I Further
Xk X ni
yij
i=1 j=1
n
1
X
y1j
j=1
T
n2
X y =
X
y2j
j=1
..
nk .
X
y kj
j=1
Example (Contd.)
I Thus α̂ = 0 and β̂i = ȳi0 for all i are one set of solution of the
normal equation.
Example (Contd.)
Hβ = 0.
Identifiability Constraints (Contd.)
= (X T X + H T H)β
= (X T X )β = X T y .
Identifiability Constraints (Contd.)
= (X T X + H T H)β
= (X T X )β = X T y .
(X T X + H T H)β = X T y .
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
I And (X T X + H T H) is a p × p matrix.
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
I And (X T X + H T H) is a p × p matrix.
I This means
(X T X + H T H)β = X T y
has a unique solution
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
I And (X T X + H T H) is a p × p matrix.
I This means
(X T X + H T H)β = X T y
has a unique solution
β̂ = (X T X + H T H)−1 X T y .
Identifiability Constraints
I Further
" # " #
T
T T X X X
Rank[(X X + H H)] = Rank = Rank =p
H H H
I And (X T X + H T H) is a p × p matrix.
I This means
(X T X + H T H)β = X T y
has a unique solution
β̂ = (X T X + H T H)−1 X T y .
I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
Working with constraint ni βi = 0
I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).
P
Working with constraint ni βi = 0
I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).
I As such
T 0 0
H H= T .
0 n(1) n(1)
P
Working with constraint ni βi = 0
I T = (n , n , ..., n ) so that
Let us denote n(1) 1 2 k
T T
P
n(1) 1k = 1k n(1) = ni = n.
P
I The constraint ni βi = 0 when expressed in the form
Hβ = 0, choice of H becomes (0, n1 , n2 , ..., nk ).
I As such
T 0 0
H H= T .
0 n(1) n(1)
I Recall that T
n n(1)
T
X X =
n(1) D
where D = diag (n1 , n2 , ..., nk ).
I This implies
T
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)
I This implies
T
n n(1)
X T X + HT H = T .
n(1) D + n(1) n(1)
where F = P − QS −1 R.
I If A is non-singular
A−1 uv T A−1
(A + uv T )−1 = A−1 − .
1 + v T A−1 u
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.
I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.
I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)
k
X Xk
T
I Now n(1) D −1 n(1) = ni2 n1i = ni = n.
i=1 i=1
T
I Applying the second fact let us find inverse of D + n(1) n(1) because we
know that D is non-singular.
I Now
T −1 D −1 n(1) n(1)
T
D −1
(D + n(1) n(1) ) = D −1 − T
.
1 + n(1) D −1 n(1)
k
X Xk
T
I Now n(1) D −1 n(1) = ni2 n1i = ni = n.
i=1 i=1
I Further
1
n1
0 ··· 0 n1
1
0
n2
··· 0 n2
D −1 n(1) =
.. .. .. .. .. = 1k .
. . . . .
1
0 0 ··· nk
nk
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .
I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .
I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1
I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1
I As such QS −1 R is
T −1
1
T
n(1) (D + n(1) n(1) T
) n(1) = n(1) D −1 − 1k 1Tk n(1)
n+1
T
n(1) 1k 1Tk n(1)
T
= n(1) D −1 n(1) −
n+1
n2 n
=n− = .
n+1 n+1
I Hence D −1 n(1) n(1)
T
D −1 = 1k 1Tk .
I Thus
T −1 1
(D + n(1) n(1) ) = D −1 − 1k 1Tk .
n+1
I As such QS −1 R is
T −1
1
T
n(1) (D + n(1) n(1) T
) n(1) = n(1) D −1 − 1k 1Tk n(1)
n+1
T
n(1) 1k 1Tk n(1)
T
= n(1) D −1 n(1) −
n+1
n2 n
=n− = .
n+1 n+1
I Thus F = p − QS −1 R = n − n2
n
n+1
= n+1
and F −1 = n+1
n2
.
I Next we find F −1 QS −1 as
n+1 T T −1
n (D + n(1) n(1) )
n2 (1)
n + 1 T −1 1k 1T
k
= n D −
n2 (1) n+1
T T
n + 1 T n(1) 1k 1k
= 1 −
n2 k n2
n+1 T 1 1
= 1k − 2 (n, n, ...n) = 2 1T
n 2 n n k
I Further RF −1 QS −1 is
n1 n1
. 1 1
1 . 1 T
.
. n2 ··· = . 1 ··· 1 = n(1) 1k .
n2 n2 . n2
nk nk
I Further RF −1 QS −1 is
n1 n1
. 1 1
1 . 1 T
.
. n2 ··· = . 1 ··· 1 = n(1) 1k .
n2 n2 . n2
nk nk
I This implies S −1 RF −1 QS −1 is
T −1 1 T 1 −1 1 T T
(D + n(1) n(1) ) n(1) 1k = (D − 1k 1k )n(1) 1k
n2 n2 n+1
1 T 1 T T
= 1k 1k − 1k (1k n(1) )1k
n2 n+1
1 T n T
= 1k 1k − 1k 1k
n2 n+1
1 T
= 1k 1k .
n2 (n + 1)
I Hence S −1 + S −1 RF −1 QS −1 is
−1 1 T 1 T
D − 1k 1k + 1k 1k
n+1 n2 (n + 1)
−1 n−1 T
=D − 1k 1 k
n2
1
− n−1 − n−1 ··· − n−1
n n2 2 2
1 n−1
n n
1 − n−1 − n−1
− 2 ···
n n2 2 n n2
= . . . .
. . . .
. . . .
n−1
− 2 − n−1 ··· 1 − n−1
n 2 n nk 2 n
I Hence S −1 + S −1 RF −1 QS −1 is
−1 1 T 1 T
D − 1k 1k + 1k 1k
n+1 n2 (n + 1)
−1 n−1 T
=D − 1k 1 k
n2
1
− n−1 − n−1 ··· − n−1
n n2 2 2
1 n−1
n n
1 − n−1 − n−1
− 2 ···
n n2 2 n n2
= . . . .
. . . .
. . . .
n−1
− 2 − n−1 ··· 1 − n−1
n 2 n nk 2 n
I Finally S −1 RF −1 is
−1 1 T
n+1
D − 1k 1k n(1)
n+1 n2
n + 1 n 1
= 1k − 1k = 1k .
n2 n+1 n2
I Thus we get
k ni
XX
yij
i=1 j=1
n+1 −1 −1 −1 n1
2 ···
n2 n2 n2
X
n1 y1j
1 − n−1 − n−1 n−1
α̂
− 2 ··· − 2
n n1 n2 2 n n
j=1
β̂1 − 1 1
− n−1 − n−1 ··· − n−1 X n2
n2 n2 2 n2
. = n2 n
.
. .
y2j
. . . . .
j=1
. . . . .
. . . . .
β̂k .
− 12 − n−1 − n−1 1 − n−1 .
···
n2 n2 n2 .
n nk
n
X k
y
kj
j=1
I Thus we get
k ni k ni
n + 1XX 1 XX
α̂ = yij − yij
n2 i=1 j=1 n2 i=1 j=1
k ni
1 XX
= yij = ȳ00
n i=1 j=1
which is the mean of all the observations, that is, the grand mean.
I Thus we get
k ni k ni
n + 1XX 1 XX
α̂ = yij − yij
n2 i=1 j=1 n2 i=1 j=1
k ni
1 XX
= yij = ȳ00
n i=1 j=1
which is the mean of all the observations, that is, the grand mean.
k ni n ni
1 XX n − 1 X X̀ 1 n−1 X
β̂i = − yij + y`j + + yij
n2 i=1 j=1 n2 j=1
ni n2 j=1
`6=i
k ni k n ni
1 XX n − 1 X X̀ 1 X
=− yij + y`j + yij = ȳi0 − ȳ00
n2 i=1 j=1 n2 j=1
ni j=1
`=1
I Want to minimize
n
X
S= (yi − α − β1 x1i − β2 x2i − ... − βk xki )2
i=1
P
w.r.t. α, β1 , ..., βk subject to the condition ni βi = 0.
Alternative Method
I Want to minimize
n
X
S= (yi − α − β1 x1i − β2 x2i − ... − βk xki )2
i=1
P
w.r.t. α, β1 , ..., βk subject to the condition ni βi = 0.
I We understand
k
X
S= Si
i=1
where
ni
X
Si = (yij − α − βi )2 .
j=1
∂S P ∂Si
I Thus setting ∂α = ∂α = 0 we get
ni
k X
X
(yij − α − βi ) = 0
i=1 j=1
P
which implies α̂ = ȳ00 as ni βi = 0.
∂S P ∂Si
I Thus setting ∂α = ∂α = 0 we get
ni
k X
X
(yij − α − βi ) = 0
i=1 j=1
P
which implies α̂ = ȳ00 as ni βi = 0.
∂S ∂Si
I Again from ∂βi = ∂βi = 0 we get
ni
X
(yij − α − βi ) = 0
j=1
ŷ = X β̂ = X (X T X )− X T y = Hy (say)
What about the fitted values and residuals?
ŷ = X β̂ = X (X T X )− X T y = Hy (say)
ŷ = X β̂ = X (X T X )− X T y = Hy (say)
I We can write
E (ŷ ) = HE (y ) = HX β = X β
and
Var (ŷ ) = σ 2 H = σ 2 X (X T X )− X T .
Fitted values and residuals (Contd.)
ŷ = Hy
ŷ = Hy
e = y − ŷ = (In − H)y .
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.
Estimable functions
I One question is: what do we mean, when we say, β is
estimable.
I Then
E (e T e) = E (y T (I − H)y )
= (X β)T (I − H)(X β) + Tr (σ 2 (I − H))
Estimation (Contd.)
Tr (I − H) = Tr (I ) − Tr (H)
= n − Rank(H) = n − r
if Rank(X ) = r .
Estimation (Contd.)
Tr (I − H) = Tr (I ) − Tr (H)
= n − Rank(H) = n − r
if Rank(X ) = r .
T
e e
I Thus n−r is an unbiased estimate of σ 2 . We call it least square
estimate of σ 2 : that does not mean we apply least squares to
get the estimator, rather the estimator is obtained from least
square residuals. We denote it as
c2 LSE = SSE .
σ
n−r
Maximum Likelihood Estimation
and hence
n 1
` = ln L(β, σ 2 ) = const − ln σ 2 − 2 (y − X β)T (y − X β).
2 2σ
MLE (Contd.)
∂`
I Setting ∂β = 0 we get
1 T
X (y − X β) = 0
σ2
MLE (Contd.)
∂`
I Setting ∂β = 0 we get
1 T
X (y − X β) = 0
σ2
XTXβ = XTy
∂`
I Setting ∂β = 0 we get
1 T
X (y − X β) = 0
σ2
XTXβ = XTy
I Hence
L(β, σ 2 ) ≤ L(β̂, σ 2 ) for all σ 2 > 0
with equality iff β = β̂.
MLE (Contd.)
1 (y − X β̂)T (y − X β̂)
− n = 0.
σ2 σ2
MLE (Contd.)
1 (y − X β̂)T (y − X β̂)
− n = 0.
σ2 σ2
c2 = 1 (y − X β̂)T (y − X β̂).
σ
n
MLE (Contd.)
c2 ) ≥ L(β̂, σ 2 ) ≥ L(β, σ 2 ).
L(β̂, σ
MLE (Contd.)
c2 ) ≥ L(β̂, σ 2 ) ≥ L(β, σ 2 ).
L(β̂, σ
β̂MLE = β̂LSE
and
c2 MLE = 1 (y − X β̂)T (y − X β̂) = SSE .
σ
n n
General linear hypothesis
I Consider a linear model
y = X β + .
General linear hypothesis
I Consider a linear model
y = X β + .
H0 : Lβ = c
y = X β + .
H0 : Lβ = c
y = X β + .
H0 : Lβ = c
and
R12 = SSEH0 = min (y − X β)T (y − X β).
β:Lβ=c
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
Testing H0 : Motivation 1
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
Testing H0 : Motivation 1
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .
Testing H0 : Motivation 1
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .
I Here the value of SSE serves as the benchmark for judging SSH0 .
Testing H0 : Motivation 1
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .
I Here the value of SSE serves as the benchmark for judging SSH0 .
I Hence ideally the ratio SSH0/SSE should serve as the test statistic. But this ratio does not have
any standard distribution.
Testing H0 : Motivation 1
I We note that SSH0 is the change in SSE due to H0 or we can interpret this as sum of squares due
to departure from H0 .
I Hence the value of SSH0 serves as a measure for testing H0 : the larger the value of SSH0 , the
more is the departure from H0 .
I Hence a large value of SSH0 will indicate rejection of H0 : that is, a right tailed test based on
SSH0 will be appropriate for testing H0 .
I Here the value of SSE serves as the benchmark for judging SSH0 .
I Hence ideally the ratio SSH0/SSE should serve as the test statistic. But this ratio does not have
any standard distribution.
I Hence, in order to make a test statistic with a standard distribution we need to consider
F = MSH0/MSE which according to previous discussion will have F distrbution under H0 .
Confusion with the word ANOVA
I If SSH0 comes out significantly larger than SSE , we conclude that the
variation caused due to H0 is significant and non-ignorable.
Motivation 2
I It should be intuitive that SSE represents variation due to random causes
which we consider to be allowable.
I If SSH0 comes out significantly larger than SSE , we conclude that the
variation caused due to H0 is significant and non-ignorable.
SSH0 /df1
I Hence we shall compute the ratio and this should be our test
SSE /df2
statistic for testing H0 .
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
I We can incorporate the precision of each β̂ into a a suitable
distance measure is to use the quadratic
(Lβ̂ − c)T [Var (Lβ̂)]−1 (Lβ̂ − c)
where Var (Lβ̂) = LVar (β̂)LT = σ 2 L(X T X )− LT .
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
I We can incorporate the precision of each β̂ into a a suitable
distance measure is to use the quadratic
(Lβ̂ − c)T [Var (Lβ̂)]−1 (Lβ̂ − c)
where Var (Lβ̂) = LVar (β̂)LT = σ 2 L(X T X )− LT .
I If we estimate σ 2 by its unbiased estimate SSE 2
n−r = σ̂ , we arrive
at
(Lβ̂ − c)T [L(X T X )− LT ]−1 (Lβ̂ − c)
σ̂ 2
Motivation 3
I For testing H0 : Lβ = c, a natural statistic Lβ̂ − c
I H0 will be rejected if Lβ̂ is sufficiently far away from c.
I However, not every element in Lβ̂ should be treated the same,
as they have different precisions.
I We can incorporate the precision of each β̂ into a a suitable
distance measure is to use the quadratic
(Lβ̂ − c)T [Var (Lβ̂)]−1 (Lβ̂ − c)
where Var (Lβ̂) = LVar (β̂)LT = σ 2 L(X T X )− LT .
I If we estimate σ 2 by its unbiased estimate SSE 2
n−r = σ̂ , we arrive
at
(Lβ̂ − c)T [L(X T X )− LT ]−1 (Lβ̂ − c)
σ̂ 2
I Our test statistic will be a constant times this quadratic
measure.
F-test for general linear hypothesis
I Consider a linear model
I Here
R02 = min(y − X β)T (y − X β)
β
and
R12 = min (y − X β)T (y − X β).
β:Lβ=c
I In our new notation the test statistic can be written as
SSH0/q MSH0
F = SSE/n−r
= ∼ Fq,n−r under H0 .
MSE
I In our new notation the test statistic can be written as
SSH0/q MSH0
F = SSE/n−r
= ∼ Fq,n−r under H0 .
MSE
y1 = α1 + 1
y2 = 2α1 − α2 + 2
y3 = α1 + 2α2 + 3
where ∼ N3 (0, σ 2 I3 ).
Example
I Consider the linear model
y1 = α1 + 1
y2 = 2α1 − α2 + 2
y3 = α1 + 2α2 + 3
where ∼ N3 (0, σ 2 I3 ).
y1 = α1 + 1
y2 = 2α1 − α2 + 2
y3 = α1 + 2α2 + 3
where ∼ N3 (0, σ 2 I3 ).
which is of rank 2.
I Also the hypothesis can be written as
H0 : Lβ = 0
α1
where L = (1, −1) and β = where Rank(L) = q = 1.
α2
I Also the hypothesis can be written as
H0 : Lβ = 0
α1
where L = (1, −1) and β = where Rank(L) = q = 1.
α2
H0 : Lβ = 0
α1
where L = (1, −1) and β = where Rank(L) = q = 1.
α2
I Now
T 6 0
X X =
0 5
and
T y1 + 2y2 + y3
X y= .
−y2 + 2y3
I Thus
1 1
α̂1 0 y1 + 2y2 + y3 (y1 + 2y2 + y3 )
β̂ = = 6 1 = 6
1
α̂2 0 5 −y2 + 2y3 5 (2y3 − y2 )
I Thus
1 1
α̂1 0 y1 + 2y2 + y3 (y1 + 2y2 + y3 )
β̂ = = 6 1 = 6
1
α̂2 0 5 −y2 + 2y3 5 (2y3 − y2 )
I Then
I Then
I Also
1
T −1 T
6 0 1 11
L(X X ) L = 1 −1 1 = .
0 5 −1 30
Method I
I Also
1
T −1 T
6 0 1 11
L(X X ) L = 1 −1 1 = .
0 5 −1 30
I Hence
SSEH0 = (y1 − α̂H )2 + (y2 − α̂H )2 + (y3 − 3α̂H )2 .
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .
I Hence
SSEH0 = (y1 − α̂H )2 + (y2 − α̂H )2 + (y3 − 3α̂H )2 .
SSEH0 − SSE
F = .
SSE
Method II
I We shall find SSEH0 for which we shall require the least square estimates
β̂H under the restriction α1 = α2 .
I Hence
SSEH0 = (y1 − α̂H )2 + (y2 − α̂H )2 + (y3 − 3α̂H )2 .
SSEH0 − SSE
F = .
SSE
ȳ1 − ȳ2
t= q
s n11 + n12
ȳ1 − ȳ2
t= q
s n11 + n12
z = α + βx +
or more precisely
z = α + βx +
or more precisely
I Here (
y1i , i = 1, 2, ..., n1
zi =
y2(i−n1 ) , i = n1 + 1, ..., n1 + n2
and 1 , 2 , ..., n are the random errors.
I If we compare the two formulations we note that in case of
linear models we are assuming
I Let us obtain the test statistic for testing the linear hypothesis
0
H0 : β = 0 against H1 : β 6= 0 in case of the linear model.
I First we note that the hypothesis H0 : β = 0 when expressed
as linear hypothesis H : Lθ = 0, the choice of L becomes
L = `T = (0, 1).
I First we note that the hypothesis H0 : β = 0 when expressed
as linear hypothesis H : Lθ = 0, the choice of L becomes
L = `T = (0, 1).
I Also
T T −1
1 n2 −n2 0 n
` (X X ) `= 0 1 =
n1 n2 −n2 n 1 n1 n2
n1 + n2 1 1
= = + .
n1 n2 n1 n2
I Moreover
I Hence
I Hence
(ȳ10 − ȳ20 )2
F =
S 2 n11 + n12
I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are
<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).
I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are
<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.
I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).
I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are
<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.
I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.
I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are
<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.
I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.
I Note that the former Fisher’s t−test approach allows the alternative
hypothesis to be one-sided, that is we can choose the alternative to be
H1 : µ1 > µ2 or H1 : µ1 < µ2 .
I Hence what we find is the equivalence in test statistics between the two
approaches and their corresponding tests (critical regions).
I We note that the relation between the two test statistics is F = t 2 and
their corresponding level α critical regions are
<1 = {y : |t| > tn1 +n2 −2;α } and <2 = {y : F > F1,n1 +n2 −2;α }.
I Since F1,n1 +n2 −2 = tn21 +n2 −2 both the critical regions are exactly the same.
I Note that the former Fisher’s t−test approach allows the alternative
hypothesis to be one-sided, that is we can choose the alternative to be
H1 : µ1 > µ2 or H1 : µ1 < µ2 .
I But the F tests in the linear hypothesis approach allows only for both
sided test H1 : β 6= 0.