Lecture Notes On Multivariate Analysis
Lecture Notes On Multivariate Analysis
on
Multivariate Analysis
Preface CONTENTS
We would like to give an idea of some of the special feature of this monograph on
S.no. Topics Page no.
Multivariate Analysis before we go further; let us tell you that it has been prepared according
to new syllabus prescribed by UGC. We have written this notes in such a simple style that 1 Basic concept of multivariate analysis 1
even the weak student will be able to understand very easily.
2 Multivariate normal distribution 7
We are sure you will agree with us that the facts and formula of multivariate analysis is just
the same in all the books, the difference lies in the method of presenting these facts to the 3 Estimation of parameters in multivariate normal distribution 51
students in such a simple way that while going through this notes, a student will feel as if a
4 Wishart distribution 63
teacher is sitting by his side and explaining various things to him. We are sure that after
reading this lecture notes, the student will develop a special interest in this field and would 5 Multiple and partial correlations 95
like to help to analyze such type of data in other discipline as well.
6 Hotelling's − T 2 111
We think that the real judges of this monograph are the teachers concern and the student for
whom it is meant. So, we request our teacher friends as well as students to point out our 7 Discriminant analysis 133
mistakes, if any, and send their comments and suggestions for the further improvement of this 8 Principal components 141
monograph. Wishes you a great success.
9 Canonical correlation and canonical variate 147
BASIC CONCEPT OF MULTIVARIATE ANALYSIS suitable linear transformations and to choose a very limited number of the resulting linear
combination in some optimum manner, disregarding the remaining linear combination in the
Multivariate analysis is the analysis of observations on several correlated random variables hope that they do not contain much information. Among the multiple techniques there are
for a number of individuals in one or more samples simultaneously, this analysis, has been two main groups. Some methods concerned with relationship between variables called
used in almost all scientific studies. variable-oriented. The important variable-oriented techniques are total, partial, multiple and
canonical correlations, principal component analysis and so on. While the other group
For example, the data may be the nutritional anthropometrical measurements like height,
concerned with the relationships between individuals called individual-oriented. In such
weight, arm circumference, chest circumference, etc. taken from randomly selected students
cases problem of classification arises to assign an individual of unknown origin. One of the
to assess their nutritional studies. Since here we are considering more than one variable this is
two populations we make use of discriminant analysis.
called multivariate analysis.
The sample data may be a collection of measurements such as lengths and breadths of petals Basic theory of matrices
and sepals of different flowers from two different species to identify their group
characteristics. • Given a p × p square matrix A = (aij ) , the elements a11 L a pp form the principal
The data may be information such as annual income, saving, assets and number of children diagonal or simply the diagonal of A and are called its diagonal elements, whereas the
and so on collected from a randomly selected families. elements a1 p a2 p −1 L a p1 constitute the secondary diagonal of A . For i ≠ j , aij are the
As in the univariate case, we shall assume that a random sample of multivariate observations off diagonal elements of A .
has been collected from an infinite or finite population. Also the multivariate techniques like • A matrix of order p , say A = (aij ) is said to be symmetric, if
their univariate counterpart allow one to evaluate hypothesis or results regarding the
population by means of sample observations. aij = a ji , ∀ i, j = 1, 2,L , p , or A = A′ .
For this purpose, we consider the multiple measurements together as a system of • And skew-symmetric, if aij = − a ji , ∀ i, j = 1, 2,L , p , or A = − A′ .
measurements. Thus xiα , denotes the i − th measurement of the α − th individual. Normally,
we denote the number of variables by p and number of individuals by n . The n • A matrix A = (aij ) is called a diagonal matrix, if
measurements on p variables can be arranged as follows: aij = 0 , ∀ i ≠ j , and we write A = diag (a11 , L , a pp ) .
Individuals • A nonzero diagonal matrix whose diagonal elements are all equal is called a scalar matrix
Variable/Characteristic 1 2 ... α ... n and a scalar matrix whose diagonal elements are all unities is called a unit matrix or an
X1 x11 x12 ... x1α ... x1n identity matrix. A unit matrix is generally denoted by I , in which case a scalar matrix is
of the form cI , where c is some nonzero scalar. It may be easily seen that AI = A ,
X2 x 21 x 22 ... x 2α ... x2n IB = B , A (cI ) = cA , (aI ) B = aB , provided the products are all defined.
M M M M M M M
• A square matrix A = (aij ) is said to be upper triangular if aij = 0 , for all i > j and lower
Xi xi1 xi 2 ... xiα ... xin
triangular if aij = 0 , for all i < j , triangular if A is either upper or lower triangular.
M M M M M M M
Xp x p1 x p2 ... x pα ... x pn • A square matrix A is said to be idempotent if A 2 = A .
• A square matrix A is said to be orthogonal if A A′ = A′A = I .
This can be written in the matrix form as
• Let a i ′ denote the i − th row of matrix A = (aij ) of order p and b j its j − th column,
x11 x12 L x1α L x1n
then (i, j ) − th element of AA′ and A′A are a i′ a j and b i ′ b j respectively. Now if A is
x 21 x22 L x2α L x 2n
M M M M M M orthogonal, we have A A′ = A′A = I , and hence
= ( xiα ) ; i = 1, 2, L p , α = 1, 2, L n .
xi1 xi 2 L xiα L xin a i ′ a j = b i ′ b j = 1 , whenever i = j
M M M M M M
= 0 , whenever i ≠ j
x p1 x p 2 L x pα L x pn
Thus, both (a1 ,L , a p ) and (b1 , L , b p ) are orthonormal.
This is also known as data matrix.
The main aim of the most of the multivariate technique is to summarize a large body of data • The trace of a matrix A , written ' tr A' , is the sum of the diagonal elements of A .
by means of relatively few parameters, or to reduce the number of variables by employing • tr ( A + B) = tr A + tr B , whenever A and B are square matrices of same order.
Multivariate Analysis 3 4 RU Khan
• tr AB = tr BA , whenever both AB and BA are defined. is of degree p , is called the latent (or characteristic) polynomial of A and the equation
• tr ABC = tr BCA = tr CAB , provided all the three products are defined. A − λ I = 0 its characteristic equation.
• A square matrix A is said to be non-singular, it there exists a square matrix B of same • If we write A − λ I = a0 + a1λ + L + a p λ p it can be easily seen that a0 = A ,
order, called the inverse of A , such that AB = BA = I .
a p = (−1) p .
• The inverse of a non-singular matrix A is unique, and it is because of this uniqueness that
the inverse of A , if it exists, is generally denoted by A −1 . • Every square matrix satisfies its own characteristic equation.
• ( A −1 ) −1 = A and ( A −1 )′ = ( A′) −1 , whenever A is non-singular. • Given a square matrix A , the roots of its characteristic equation, A − λ I = 0 , are called
its latent (or characteristic or secular) roots or eigen (or proper) values. Clearly if A is
• ( AB) −1 = B −1 A −1 , whenever A and B are non-singular matrices of the same order. p × p , then A has p latent roots.
• The determinant of a diagonal or triangular matrix equals the product of its diagonal • The latent roots of a diagonal and triangular matrix are their diagonal elements.
elements.
• The determinant of a square matrix is the product of its latent roots.
• A′ = A , AB = BA = A B = B A .
• A square matrix is non-singular if and only if none of its latent roots is zero.
• A ≠ 0 , if A is non-singular, since 1 = AA−1 = A A −1 . • Given a latent roots λ of a square matrix A , x is called a latent vector of A
corresponding to the root λ if ( A − λ I ) x = 0 , i.e. Ax = λ x .
• A = 1 or − 1 , if A is orthogonal.
• If λ is a latent root of a non-singular matrix A , then λ−1 is a latent root of A −1 .
• a A = a p A , where p is the order of A .
• If λ is a latent root of an orthogonal matrix A , so is λ−1 a latent roots of A .
• A = 0 whenever A has a zero-row or a zero-column or two identical rows (columns).
• Latent roots of an idempotent matrix are zeros and unities only.
• The determinant changes sign if any two rows (columns) are interchanged.
• Latent roots of a real symmetric matrix are all real.
• The determinant remains unaltered when a row (column) is replaced with the sum of that
row (column) and a scalar multiple of another row (column). • Two square matrices A and B of the same order are said to be similar if there exists a
• If A = diag ( A11 L Aqq ) , where A11 L Aqq are all square matrices, then non-singular matrix C such that A = C −1 BC .
• Similar matrices have the same characteristic polynomials.
A = A11 L Aqq .
• Similar matrices have the same latent roots.
• If A is of the form
• If the latent roots of A are all distinct, then A is similar to a diagonal matrix whose
A11 A12 L A1q A11 0 L 0 diagonal elements are the latent roots of A .
0 A22 L A2q A21 A22 L 0 • If A is a real symmetric matrix, then there exits an orthogonal matrix C such that C ′A C
M or , then A = A11 L Aqq .
M M M M M M M is a diagonal matrix.
0 0 L Aqq A1q Aq 2 L Aqq
• Let A = (aij ) be a real symmetric matrix of order p and x ′ = ( x1 , L , x p ) a p − tuple of
A A12 real variables. The polynomial
• If A = 11 , where both A11 and A22 are non-singular, then
A21 A22
∑∑ aij xi x j = x′ Ax is called a p − ary quadratic form.
−1 −1 −1 i j
A11 − A11 A12 A22
A −1 = .2 .1 .
− A A A −1
− 1 −1
A22.1 The quadratic form x ′ A x (or equivalently the real symmetric matrix A ) is said to be
22 21 11.2 •
• Given a square matrix A of order p , the matrix ( A − λ I ) , where λ is a scalar variable, ii) negative definite, if x ′ A x < 0 for all x ≠ 0 .
is called the latent (or characteristic) matrix of A , and the polynomial, A − λ I , which
iii) positive semi-definite or nonnegative definite, if x ′ A x ≥ 0 for all x ≠ 0 .
Multivariate Analysis 5
• An p − ary quadratic form x ′ A x is positive definite (negative definite) if and only if the
talent roots of A are all positive (negative).
• If A is positive definite (negative definite), − A is negative definite (positive definite).
• If A is positive definite, A > 0 .
The conditional density function of Y , given X , for two rv' s X and Y is defined as 1 1 2
1 ∞ − y2 2 ∞ − y
f ( x, y )
=
σ 2π
∫−∞ e 2 σ dy =
2π
∫0
e 2 dy .
f ( y x) = , provided that f ( x) > 0 .
f ( x)
Again put
In the general case of X 1 , X 2 , L , X p with cdf F ( x1 , x 2 , L , x p ) , the conditional density 1 2
y = t or t 2 = 2 y or 2 t dt = 2 dy
function of X 1 , X 2 , L , X r , given X r +1 = x r +1 , X r + 2 = x r + 2 , L , X p = x p is 2
1
f ( x1 , x 2 , L , x p ) 1 1 1 −2
. or dt = dy or dt = dy = y dy , then
∞ ∞
∫−∞ L ∫−∞ f (u1 , Lu r , u r +1 , L, u p ) du1 L du r t 2y 2
1
∞ 2 ∞ −y 1 −1 / 2 1 ∞ − y 2 −1 1
Transformation of variables
∫− ∞ f ( x ) dx =
2π
∫ 0
e
2
y dy =
π ∫0
e y dy =
π
Γ1 / 2 = 1 .
Let X 1 , X 2 , L , X p have the joint density function f ( x1 , x 2 , L , x p ) . Consider p real-
∞ −α
valued functions y i = y i ( x1 , x 2 , L , x p ) , i = 1, 2,L , p . We assumed that the transformation Since gamma function ∫0 e α p −1 dα = Γp , and Γ1 / 2 = π .
of Y to X be one-to-one, the inverse transformation is xi = x i ( y1 , y 2 , L , y p ) ,
i = 1, 2,L , p . Let the random variable Y1 , Y2 , L , Y p be defined by Moment generating function of univariate normal distribution
X −µ
Yi = y i ( X 1 , X 2 , L , X p ) If X ~ N (µ , σ 2 ) , then Y = ~ N (0, 1) .
σ
Then the joint density function of Y1 , Y2 , L , Y p is By definition,
y2 1
g ( y1 , y 2 , ..., y p ) = f [ x1 ( y1 , y 2 , ..., y p ), ..., x p ( y1 , y 2 , ..., y p )] J , where ∞ 1 − 2
1 ∞ − 2 ( y −2 t y)
M Y (t ) = E[e ty ] = ∫ e ty f ( y ) dy = ∫ e ty e 2 dy = ∫ e dy
−∞ 2π 2π −∞
∂ x1 ∂ x1
L
∂ y1 ∂ yp 1
− ( y 2 − 2 t y +t 2 −t 2 )
1 2
1 ∞ 1 2 ∞ − ( y −t)
J = mod M L M = Jacobian of transformation. = ∫− ∞ e 2 dy = et / 2 ∫ e 2 dy .
∂ xp ∂ xp 2π 2π −∞
L
∂ y1 ∂ yp Let
y − t = z , and dy = dz , then
Form of normal density function 1 2
2 1 ∞ − z 2
1) The univariate normal density function is given by M Y (t ) = e t / 2 ∫ e 2 dz = e t / 2 .
2
2π −∞
1 x− µ 1
1 − 1 − ( x − µ )′ Σ −1 ( x − µ ) Now,
2 σ
f ( x) = e = e 2 , −∞ < x < ∞ ,
σ 2π 1/ 2 1/ 2 X −µ
(2π ) Σ Y= ⇒ X = µ + σY .
2 σ
−∞< µ < ∞ , Σ =σ >0.
Thus,
To show that
∞ M X (t ) = E e tx = E e t ( µ + σ y ) = E e tµ e t σ y = e tµ E [e t (σ y ) ]
∫−∞ f ( x) dx = 1 . 2 2
= e t µ M Y ( tσ ) = e t µ + t σ / 2 .
Multivariate normal distribution 11 12 RU Khan
2) The bivariate normal density function This is the pdf of bivariate normal distribution with mean µ and variance covariance
x − µ 2
2 matrix Σ , thus X ~ N 2 ( µ , Σ ) .
−
1 1 1 − 2 ρ x1 − µ1 x2 − µ2 + x2 − µ 2
2 σ σ σ σ 1
2 (1− ρ ) 1 1 2 2
f ( x1 , x 2 ) =
1
e Exercise: Find the marginal of X 1 and X 2 , if f ( x1 , x 2 ) = e −Q / 2 ,
2π σ 1σ 2 1 − ρ 2 (2π ) σ 1σ 2 1 − ρ 2
where
1
− ( x − µ ) ′ Σ −1 ( x − µ )
or f ( x ) =
1
e 2 . 1 x − µ 2 x1 − µ1 x2 − µ 2 x2 − µ2
2
1 1 .
Q= − 2 ρ σ +
1/ 2
2π Σ 2 σ
(1 − ρ ) 1 1 σ2 σ2
X µ
Proof: Let X = 1 , corresponding µ = 1 , and Solution: For, marginal of X 1 , we have
X
2 µ2
1 x − µ 2 x1 − µ1 x2 − µ 2 x2 − µ2
2
1 1
σ 11 σ 12 σ 1 ρ σ 1σ 2 − 2 ρ σ +
2 σ ij Q=
2 σ
∑ = = , where ρ = . (1 − ρ ) 1 1 σ2 σ2
σ 21 σ 22 ρ σ 1σ 2 σ 22
σ iσ j
x − µ 2 2
Now 1 ρ 2 1 1 2 x1 − µ1
= + (1 − ρ ) σ
(1 − ρ 2 ) σ 1 1
∑ = σ 12σ 22 − ρ 2σ 12σ 22 = σ 12σ 22 (1 − ρ 2 ) .
2
Thus, x − µ1 x 2 − µ 2 x2 − µ 2
− 2 ρ 1 +
1 ρ σ1 σ 2 σ2
−
1 σ2
− ρ σ 1σ 2 1 σ 12 σ 1σ 2
∑ −1 = 2 = 2 2
2 − ρ σ σ 2 ρ 1 1 x 2 − µ 2 x − µ1 x − µ1
(1 − ρ ) σ 12σ 22 1 2 σ1 2
(1 − ρ ) − = − ρ 1 + 1
σ 1σ 2 σ 22 (1 − ρ 2 ) σ 2 σ 1 σ1
2 2
Consider 1 σ x − µ1
= x − µ 2 − ρ 2 ( x1 − µ1 ) + 1
2 2 σ
1 ρ 2
σ 2 (1 − ρ ) 1 σ1
−
−1 σ 12 1 σ 1σ 2 x1 − µ1 2
( x − µ ) ∑ ( x − µ ) = ( x1 − µ1 , x 2 − µ 2 )
′ x − µ 2* 2
(1 − ρ 2 ) − ρ 1 x 2 − µ 2 = 2 + x1 − µ1 ,
σ*
σ 1σ 2 2
σ2 2 σ1
σ2
x −µ ρ ( x2 − µ 2 ) ρ ( x1 − µ1 ) x 2 − µ 2 x − µ where µ 2* = µ 2 + ρ ( x1 − µ1 ) , and σ 2*2 = σ 22 (1 − ρ 2 ) .
=
1 1 1
− ,− + 1 1
σ1
2
1 − ρ σ 12 σ 1σ 2 σ 1σ 2 σ 22 x 2 − µ 2
Therefore,
(x − µ )2 ρ ( x − µ ) (x − µ ) ρ (x − µ ) (x − µ ) (x − µ )2 2 2
1
1 1 1 1 2 2 1 1 2 2
+ 2 2 1 x − µ 2* x − µ1
exp − 2
= − − 1
f ( x1 , x 2 ) = + 1
1 − ρ σ 12
2 σ 1σ 2 σ 1σ 2 σ 22
(2π ) σ 1σ 2*
2 * σ 1
σ 2
x − µ 2 ( x1 − µ1 ) ( x 2 − µ 2 ) x 2 − µ 2
2
1 1 1
+ . 1 x − µ 2 1 x − µ * 2
= − 2ρ =
1
exp − 1 1 1
exp − 2 2
1 − ρ σ 1
2 σ 1σ 2 σ2
2π σ 1 2 σ 1 2π σ * 2 σ 2 *
2
Therefore,
1
− ( x − µ )′ ∑ −1 ( x − µ )
= f ( x1 ) f ( x 2 x1 ) , since ∫x2 f ( x1, x 2 ) dx2 = f ( x1 ) .
1
f ( x) = e 2 .
(2π ) 2 2 ∑
1/ 2 Hence, the marginal distribution of X 1 is N ( µ1 , σ 12 ) , i.e. X 1 ~ N (µ1 , σ 12 ) .
Multivariate normal distribution 13 14 RU Khan
1x −µ 2 * 2 σ2 σ2
1 1 1 x − µ1 σ
+ 1 t12 + t 22 ρ 2 2 + 2t1t 2 ρ 2
= exp − 2 2 exp − 1
2π σ 2 * 2 2 σ1
2 σ2 σ1
*
2π σ 1 2 σ 1
1 1 1 1
= exp t1 µ1 + t 2 µ 2 + t 22 σ 22 − t 22 ρ 2 σ 22 + t12 σ 12 + t 22 ρ 2σ 22 + t1t 2 ρ σ 1σ 2
= f ( x 2 ) f ( x1 x 2 ) , where, ∫x1 f ( x1 , x 2 ) dx1 = f ( x2 ) . 2 2 2 2
∞ ∞ Note: If ρ = 0,
M X1 X 2 (t1 , t 2 ) = M X1 X 2 (t1 , 0) M X1X 2 (0, t 2 ) , i.e. X1 and X 2 are
t x +t 2 x2
=∫
−∞ ∫−∞
e1 1 f ( x1, x2 ) dx1 dx2
independently distributed, if and only if M (t1 , t 2 ) = M (t1 , 0) M (0, t 2 ) .
Multivariate normal distribution 15 16 RU Khan
3 − 4 Y − µ* 8 − 4 1.44
= P (− 1 < Z < 1) or 1 − ρ 2 = = 0.36 ⇒ ρ 2 = 1 − 0.36 = 0.64 or ρ = ±0.8 .
P (3 < Y < 8 X = 7 ) = P < 2
< 4
4 σ * 4 4
2
Exercise: Find µ mean vector and Σ variance covariance matrix or dispersion matrix of the
= P(−0.25 < Z < 1) = P (−0.25 < Z < 0) + P (0 < Z < 1) = 0.440 . following density functions:
iii) Given X ~ N (3 ,16 ) 1 2 2
1 − 2 [( x−1) +( y−2) ]
− 3 − 3 X − µ1 3 − 3 i) f ( x, y ) = e ,
P (−3 < X < 3) = P < < 2π
4 σ1 4
1 x2 xy 2
− −1.6 + y
6 1 0.72 4 2
= P − < Z < 0 = P (−1.5 < Z < 0) = 0.4332 . ii) f ( x, y ) = e ,
4 2.4π
Multivariate normal distribution 17 18 RU Khan
1
2 2 Alternative Solution
1 − 2 ( x + y +4 x −6 y +13)
iii) f ( x, y ) = e , and ∂Q
2π Since Q = x 2 + y 2 + 4 x − 6 y + 13, = 2x + 4 = 0 ⇒ x = −2 , i.e. µ1 = −2 .
∂x
1
1 − (2 x 2 + y 2 + 2 xy −22 x −14 y +65)
iv) f ( x, y ) = e 2 . ∂Q − 2
2π = 2y − 6 = 0 ⇒ y = 3 , i.e. µ 2 = 3 or µ = , and
∂y 3
Solution: If ( X , Y ) ~ N 2 ( µ1 , µ 2 , σ 12 , σ 22 , ρ ) , then
1
coeff . of x 2 coeff . of x y
1 −1 1 0 −1 2
f ( x, y ) = e −Q / 2
, Σ
=
= Σ , where Σ = 1
(2π ) σ 1σ 2 1 − ρ 2 0 1 coeff . of x y coeff . of y 2
2
where
iv) We have
1 x − µ1
2 2
x − µ1 y − µ 2 y − µ2 1 2 2
Q= − 2 ρ + . 1 − 2 ( 2 x + y + 2 xy −22 x−14 y+ 65)
1 − ρ 2 σ 1 σ1 σ 2 σ2 f ( x, y ) = e ,
2π
1 2 2
1 − 2[( x−1) +( y −2) ] where Q = 2 x 2 + y 2 + 2 xy − 22 x − 14 y + 65 .
i) We have f ( x, y ) = e ,
2π ∂Q ∂Q
= 4 x + 2 y − 22 = 0 , and = 2 y + 2 x − 14 = 0 .
⇒ µ1 = 1, µ 2 = 2, σ 1 = 1, σ 2 = 1 , and ρ = 0 . ∂x ∂y
∞ ∞
1
− y′ y or A −1 = C C ′ , as ( AC ) −1 = C −1 A −1 (2.8)
∫−∞ K ∫−∞ k C e 2 dy1 K dy p = 1
From equation (2.7) and (2.8), we get, A = Σ −x 1 .
1
∞ ∞ − ( y12 + y22 + K + y 2p )
or ∫−∞
K
−∞ ∫
k C e 2 dy1 K dy p = 1 Also
2
A −1 = C C ′ = C C ′ = C , as C = C ′ , so that
1 2 1 2
∞ − y1 ∞ − yp
or k C ∫ e 2 dy1 K ∫ e 2 dy p = 1
−∞ − ∞ C = Σx
1/ 2
.
1 Hence,
1 ∞ − t2 1
or k C ( 2π ) K ( 2π ) = 1 , as ∫
2 π −∞
e 2 dt = 1 or k =
C (2π ) p / 2
(2.5)
f ( x) =
1 1
exp − ( x − µ x )′ Σ −x1 ( x − µ x ) .
p/2 1/ 2 2
(2π ) Σx
Substituting from (2.5) in (2.4), we get
1 Definition: A random vector X = ( X 1 , X 2 , ..., X p ) ′ taking values x = ( x1 , x 2 , ..., x p ) ′ in
− y′ y
e 2 1 2 1 2 n
C 1 − 2 y1 1 −2 yp E (Euclidean space of dimension p ) is said to have a p − variate normal distribution if its
g ( y) = = e L
e .
C (2π ) p / 2 2π 2π probability density function can be written as
1 1
Thus, y1 , y 2 , K , y p are independently normally distributed each with mean zero and f ( x) = exp − ( x − µ x )′ Σ −x1 ( x − µ x ) ,
1/ 2 2
variance unity. (2π ) p / 2 Σ x
Therefore, where
E ( x − b) = E (C y ) = C E ( y ) = 0 µ = ( µ1 , L , µ p )′ , and Σ is a positive definite symmetric matrix of order p .
Multivariate normal distribution 21 22 RU Khan
Hence, Therefore,
1 p x −µ 2 p 1x −µ 2 Y ~ N (C µ , C Σ C ′) .
1 1
f ( x) = exp − ∑ i i =∏ exp − i i
p 2 i =1 σ i i =1 (2π )1/ 2 σ i 2 σi Exercise: Given ( X 1 , X 2 ) ~ N 2 (µ1 , µ 2 , σ 12 , σ 22 , ρ ) . Find the joint density function of
(2π ) p/2
∏σ i
Y1 = X 1 + X 2 , and Y2 = X 1 − X 2 .
i =1
= f ( x1 ) f ( x 2 ) L f ( x p ) . Y 1 1 X 1
Solution: The transformation is 1 = , then, the matrix of transformation
Y2 1 − 1 X 2
Therefore, X 1 , X 2 , L , X p are independently normally distributed random variable with
1 1
C = is nonsingular, therefore,
mean µi , and variance σ i2 . 1 − 1
Theorem: If X (with p components) be distributed according to N ( µ , Σ) . Then Y = C X 1 1 µ1 µ1 + µ 2
Y ~ N 2 (C µ , C Σ C ′) , where, C µ = = = µ
(nonsingular transformation) is distributed according to N (C µ , C Σ C ′) for C nonsingular. 1 − 1 µ 2 µ1 − µ 2
y
Multivariate normal distribution 23 24 RU Khan
Σ 22 = E ( X ( 2) − µ ( 2) ) ( X (2) − µ (2) )′ , Σ12 = E ( X (1) − µ (1) ) ( X ( 2) − µ ( 2) )′ , and Therefore, X (1) and X (2) are independent.
∞ ∞ ∞ (1) Since the transformation is nonsingular, therefore, the distribution Y is also p − variate
∫−∞ ∫−∞ L ∫−∞ f ( x ) f ( x ( 2) ) dx q +1 dx q + 2 L dx p
normal with
∞ ∞ ∞
= f ( x (1) ) ∫ f ( x ( 2) ) dx q+1 dx q + 2 L dx p = f ( x (1) ) (1) −1 (2)
− ∞ −∞ ∫ L∫
−∞
Y (1) I − Σ12 Σ −1 X (1)
E Y = E ( 2) = E 22 = E X − Σ12 Σ 22 X
Y 0 ( 2) ( 2 )
I X X
Hence, the marginal distribution of X (1) is N q ( µ (1) , Σ11 ) , similarly the marginal
µ (1) − Σ Σ −1 µ (2)
distribution of X (2) is N p −q ( µ (2) , Σ 22 ) . = 12 22
µ ( 2)
Theorem: If X is distributed according to N ( µ , Σ) , the marginal distribution of any set of
and
components of X is multivariate normal with means, variances and covariances obtained by
taking corresponding components of µ and Σ , respectively. I − Σ12 Σ −1 Σ11 Σ12 I 0
ΣY = 22 , since Y = C X .
0 Σ 21 Σ 22 − Σ12 Σ −1 I
Proof: We partition X , µ and Σ as I 22
and the corresponding partition of µ and Σ will be The joint density function of Y (1) and Y (2) is
µ (1) 1 1
Σ11 Σ12 exp[ − { y (1) − ( µ (1) − Σ12 Σ −221 µ (2) )}′
µ = , and f ( y) =
Σ = . q/2 1/ 2 2
(2) Σ 21 Σ 22
(2π ) Σ11.2
µ
−1 (1) −1 ( 2 )
We shall make a nonsingular linear transformation to sub vectors Σ11 .2 { y − ( µ (1) − Σ12 Σ 22 µ )}]
−1
Σ11 (1) −1 ( 2 )
− µ (1) − Σ12 Σ 22 ( x − µ (2) )}] 1.0 0.8 − 0.4
.2 { x Σ11.2 = Σ11 − Σ12 Σ −221 Σ 21 = − (1) (− 0.4 − 0.56)
0.8 1.0 − 0.56
= f ( x (1) | x ( 2) ) .
0.8400 0.5760
= .
Therefore, the density function f (x (1)
|x ( 2)
) is a q − variate normal with mean 0.5760 0.6864
−1 ( 2)
µ (1) + Σ12 Σ 22 ( x − µ (2) ) = µ (1) * , and covariance matrix Σ11.2 = Σ11 − Σ12 Σ −221 Σ 21 , i.e. ii) For X 1 , X 3 given X 2 , we partition X as
b) The marginal distribution of X (1) ~ N q (µ (1) , Σ11 ) . The conditional distribution of X 1 , X 3 given X 2 is N 2 ( µ (1) * , Σ11.2 ) ,
where
c) The conditional distribution of X (2) given X (1) = x (1) is normal with mean
0. 8 0. 8 x 2
−1 (1)
µ ( 2) + Σ 21Σ11 −1
( x − µ (1) ) = µ ( 2) * and covariance matrix Σ 22.1 = Σ 22 − Σ 21Σ11 Σ12 . µ (1) * = µ (1) + Σ12 Σ −221 ( x (2) − µ (2) ) = 0 + (1) ( x 2 − 0) =
− 0.56 − 0.56 x 2
Multivariate normal distribution 29 30 RU Khan
and Note: This result has been proved when D is p × p and nonsingular.
−1 1.0 − 0.4 0.8 Proof: The transformation is Y = D X , where Y has q components and D is q × p real
Σ11.2 = Σ11 − Σ12 Σ 22 Σ 21 = − (1) (0.8 − 0.56)
− 0.4 1.0 − 0.56 matrix. The expected value of Y is
1.0 − 0.4 0.6400 − 0.4480 0.36 0.048 E (Y ) = D E ( X ) = D µ , and the covariance matrix is
= − = .
− 0.4 1.0 − 0.4480 0.3136 0.048 0.6864
Σ y = E [Y − E (Y )] [Y − E (Y )]′ = E [ D X − D µ ] [ D X − D µ ]′ = D Σ D ′ .
iii) For X 2 , X 3 given X 1 we partition X as
Since R( D ) = q the q rows of D are independent. We know that a set of q independent
X1
vectors can be extended to form a basis of the p − dimensional vector space by adding to it
L X
(1)
X p − q vectors.
X = = , where X (2) = 2 , and X (1) = X 1
X2 X ( 2) X3
Dq× p
X
3 Let C = , then C is nonsingular. Make the transformation
E ( p −q )× p
1.0 M 0.8 − 0.4
Z = C X . Since C nonsingular, therefore, Z ~ N p (C µ , C Σ C ′) , i.e.
L L L L Σ11 Σ12
Σ= = .
0.8 M 1.0 − 0.56 Σ 21 Σ 22 D DX
− 0.4 M − 0.56 1.0 Z = X = .
E EX
The conditional distribution of X 2 , X 3 given X 1 is N 2 (µ ( 2) * , Σ 22.1 ) , where But, D X being the partition vector of Z has a marginal q − variate normal distribution,
therefore,
−1 (1) 0. 8 0.8 x1
µ ( 2) * = µ (2) + Σ 21Σ11 ( x − µ (1) ) = 0 + (1) ( x1 − 0) = , and Y = D X ~ N q ( D µ , D Σ D ′) .
− 0 . 4 − 0.4 x1
1.0 − 0.56 0.8 Note: This theorem tells us if X ~ N p ( µ , Σ) , then every linear combination of the
−1
Σ 22.1 = Σ 22 − Σ 21Σ11 Σ12 = − (1) (0.8 0.4)
− 0.56 1.0 − 0.4 components of X has a univariate normal distribution.
0.36 − 0.24 X1
= .
− 0.24 0.84 X2 ′
Proof: Let Y = D1× p X = (l1 , l 2 , ..., l p ) = l X , then E (Y ) = l ′ µ , and Σ y = l ′ Σ l ,
M
Exercise: Let X 1 and X 2 are jointly distributed as normal with E ( X i ) = 0 and
X p
Var. ( X i ) = 1 , for i = 1, 2 . If the distribution of X 2 given X 1 is normal N1 (0, 1 − ρ 2 ) with
therefore,
ρ < 1 , find the covariance matrix of X 1 and X 2 .
Y = D1× p X = l ′ X ~ N1 (l ′ µ , l ′ Σ l ) .
X µ 0 Σ Σ12 1 Σ12
Solution: Given X = 1 , µ = 1 = , and Σ = 11 =
X2 µ
2 0 Σ 21 Σ 22 Σ 21 1 Exercise: Let X ~ N ( µ , Σ) , where X ′ = ( X 1 , X 2 , X 3 ) , µ ′ = (0, 0, 0) , and
1 − 2 0 1 / 2 0 3 − l1 + 3 − 2 l2
1 / 2 1 / 2 0 1 / 2 0 = .
DΣD ′ = − 2 5 0 1 / 2 0 = . −
1 l + 3 − 2 l2 l12 + 2 l1l2 − 2 l1 − 4 l2 + 3 + 2 l22
0 0 1 0 2
0 0 2 0 1
X
The off-diagonal term − l1 + 3 − 2 l 2 is the covariance for X 2 and X 2 − l ′ 1 .
X + X2 X3
We see that 1 and X 3 have covariance σ 12 = 0 , they are independent.
2
l1
v) The transformation is For independent − l1 + 3 − 2 l 2 = 0 , so that l = l1 + 3 .
−
X1 2
0 1 0 X2
Y = X 2 = , i.e. Y = D X
− 5 / 2 1 − 1 X 2 − (5 / 2) X 1 − X 3 Exercise: X 1 and X 2 are two rv' s with respective expectations µ1 and µ 2 and variances
X
3 σ 11 and σ 22 and correlation ρ such that
Therefore,
X 1 = µ1 + σ 11 Y1 , and X 2 = µ 2 + σ 22 (1 − ρ 2 ) Y1 + σ 22 Y2 ,
− 3
0 1 0 1 where Y1 and Y2 are two independent normal N (0,1) variates. Find the joint distribution of
Y ~ N 2 ( D µ , DΣD ′) , where D µ = 1 =
− 5 / 2 1 − 1 9 / 2 X 1 and X 2 .
4
Solution: The transformation is
and
1 − 2 0 0 − 5 / 2 X 1 − µ1 σ 11 0 Y1
i.e. X − µ = CY or X = CY + µ , so that
0 1 0 5 10 =
X 2 − µ 2 σ 22 (1 − ρ ) σ 22 Y2
2
′
DΣ D =
− 2 5 0 1 1 = .
− 5 / 2 1 − 1 10 93 / 4
0 0 2 0 −1
µ1
E X = C E (Y ) + µ = µ =
We see that X 2
5
and X 2 − X 1 − X 3 have covariance σ 12 = 10 , they are not µ2
2
independent. and
iii) Conditional distribution of X 3 given X 1 = x1 and X 2 = x 2 . Exercise: Let A = Σ −1 and Σ be partitioned similarly as
Solution: Consider the transformation A11 A12 Σ Σ12
A = Σ = 11 , by solving A Σ = I according to partitioning prove
X1 A21 A22 Σ 21 Σ 22
0 1 0 X2 ( 2) X 1
Y = X 2 = ′ X (2) , where X = , −1 −1
−
1 l 1 − l 2 2 X − l X3 i) Σ12 Σ 22 = − A11 A12
X3
i.e. Y 2×1 = D2×3 X 3×1 . ii) Σ11 − Σ12 Σ −221 Σ 21 = A11
−1
.
Multivariate normal distribution 35 36 RU Khan
A Σ + A12 Σ 21
or 11 11
A11 Σ12 + A12 Σ 22 I 0
= . = E (X X ′ ) − µ µ′ − µ µ′ + µ µ′
A21 Σ11 + A22 Σ 21 A21 Σ12 + A22 Σ 22 0 I
⇒ E (X X ′) = Σ + µ µ′ .
Consider
A11 Σ12 + A12 Σ 22 = 0 , then A11 Σ12 = − A12 Σ 22 ii) E ( X ′ A X ) = E [tr X ′ A X ] , since tr a = a , a is a scalar.
−1
or Σ12 = − A11 A12 Σ 22 or Σ12 Σ −221 = − A11
−1
A12 . = E [tr A ( X X ′ )] , since tr (C D) = tr ( D C )
Also = tr A E ( X X ′ ) = tr A (Σ + µ µ ′ )
A11 Σ11 + A12 Σ 21 = I , then
= tr AΣ + tr A µ µ ′ = tr AΣ + µ ′ A µ .
−1 −1
A11 Σ11 = I − A12 Σ 21 or Σ11 = A11 − A11 A12 Σ 21 Exercise: Let X ~ N p (0, I ) and if A and B are real symmetric matrices of order p , then
−1 −1
or A11 = Σ11 + A11 A12 Σ 21 i) E ( X ′ A X ) = tr A
ii) V ( X ′ A X ) = 2 tr A 2
−1
or A11 = Σ11 − Σ12 Σ −221 Σ 21 .
Σ − Σ12 Σ −1 Σ 21 Now
0 Σ11.2 0
C Σ C ′ = 11 22 = .
0 Σ 22 0 Σ 22 i)
′
X ′ A X = Y ′ C −1 AC −1Y = Y ′ ∆ Y , since C being orthogonal, C −1 = C ′ .
Therefore, = λ1Y12 + λ 2Y22 + L λ p Y p2 .
C Σ C ′ = Σ11.2 Σ 22 ⇒ C Σ C ′ = Σ11.2 Σ 22 , because C = 1 .
Therefore,
Similarly, we can prove
E ( X ′ A X ) = λ1 E (Y12 ) + λ 2 E (Y22 ) + L λ p E (Y p2 )
−1
Σ = Σ11 Σ 22 − Σ 21 Σ11 Σ12 = Σ11 Σ 22.1 .
= λ1 + λ 2 + L λ p , since Yi2 ~ χ12 for each i
Exercise: Let X ~ N p ( µ , Σ) , let A be a symmetric matrix of order p , show that
= tr C ′AC = tr ACC ′ = tr A .
i) E ( X X ′ ) = Σ + µ µ ′ . ii) V ( X ′ A X ) = λ12 V (Y12 ) + λ 22 V (Y22 ) + L λ 2p V (Y p2 )
iii) Since A and B are real symmetric matrices of order p , so that, A + B is real symmetric Now the characteristic function of Y is
matrix of order p . Hence, ′ i (u1Y1 + L+u pY p ) i u pY p
φY (u ) = E [e iu Y ] = E e = E e i u1Y1 L E e (2.9)
2 tr ( A + B) = V [ X ′ ( A + B) X ] = V ( X ′ A X + X ′ B X )
2
Since Y1 , Y2 , L , Y p are independent and distributed according to N (0, 1) ,
= V ( X ′ A X ) + V ( X ′ B X ) + 2 Cov ( X ′ A X , X ′ B X )
1
1 1 itµ − t 2σ 2
1 φY (u ) = exp − u12 L exp − u 2p , since X ~ N (µ , σ 2 ) , and φ X (t ) = e
⇒ Cov ( X ′ A X , X ′ B X ) = [2 tr ( A + B ) 2 − V ( X ′ A X ) − V ( X ′ B X )] 2 .
2 2 2
1 1
= tr ( A + B) 2 − tr A 2 − tr B 2 − (u12 + L +u 2p ) − u′ u
=e 2 =e 2 (2.10)
= tr ( A 2 + B 2 + AB + BA) − tr A 2 − tr B 2 Thus
= tr AB + tr BA = 2 tr AB , since tr AB = tr BA . ′X i t ′ (C Y + µ ) it ′ µ ′
φ X (t ) = E [e i t ] = E [e ]=e E [e i t ( C Y ) ]
Characteristic function 1
it ′ µ it ′ µ − (C ′ t ) ′ ( C ′ t )
The characteristic function of a random vector X is defined as =e E e i (C ′ t ) ′ Y = e e 2 from equation (2.9) and (2.10)
′X 1 1
φ X (t ) = E [ e i t ] , where t is a vector of reals, i = − 1 . it ′ µ − t ′CC ′ t i t ′ µ − t ′Σ t
=e e 2 =e 2 .
Theorem: Let X = ( X 1 , X 2 , L , X p )′ be normally distributed random vector with mean µ Corollary: Let X ~ N p (µ , Σ ) , if Z = D X , then the characteristic function of vector Z is
and positive definite covariance matrix Σ , then the characteristic function of X is given by 1
i t ′ ( D µ )− t ′ ( DΣD′) t
1 e 2 .
i t ′ µ − t ′Σ t
φX (t ) = e 2 , where t = (t1 , t 2 , L , t p )′ is a real vector of order p × 1 . Proof: The characteristic function of vector Z is defined as
Proof: We have 1
′Z ′ i ( D′ t )′ µ − ( D′t )′ Σ D′ t
1 1 φ Z (t ) = E e it = E e it D X = E e i ( D′t )′ X = e 2
f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ )
1/ 2 2
(2π ) p / 2 Σ 1
i t ′ ( D µ )− t ′ ( D Σ D′) t
=e 2 ,
Since Σ is a symmetric and positive definite, there exits a non-singular matrix C such that
which is the characteristic function of N p ( D µ , DΣD ) .
C ′ Σ −1 C = I , and Σ = C C ′ .
Theorem: If every linear combination of the components of a vector X is normally
Let X − µ = C Y , so that Y = C −1 ( X − µ ) a nonsingular transformation and the Jacobian of
distributed, then X has normal distribution.
the transformation is J = C , therefore, the density function of Y is
Proof: Consider a vector X of p − components with density function f (x ) and
1 1 iu′ X
f ( y) = exp − (C y + µ − µ ) ′ Σ −1 (C y + µ − µ ) C characteristic function φ X (u ) = E [e ] and suppose the mean of X is µ and the
p/2 1/ 2 2
(2π ) Σ
covariance matrix is Σ .
1 1
exp − ( y ′ C ′ Σ −1C y ) , since C = Σ Since u ′ X is normally distributed for every u . Then the characteristic function of u ′ X is
1/ 2
=
(2π ) p/2 2
1
′ it u′ µ − t 2 u′Σ u
1 1 1 1 1 1 E e it (u X ) = e 2 , taking t = 1 , this reduces to
= exp − y ′ y = exp − y12 L exp − y 2p .
(2π ) p/2 2
(2π )1 / 2 2 1 / 2 2
(2π ) i u′ µ − u ′Σ u
1
′
It shows that Y1 , Y2 , L , Y p are independently normally distributed each with mean zero and E e iu X = e 2 .
variance one. Therefore, X ~ N p ( µ , Σ) .
Multivariate normal distribution 39 40 RU Khan
Theorem: The moment generating function of a vector X , which is distributed according Let X − µ = C Y a nonsingular transformation. Under this transformation the quadratic form
1
t ′ µ + t ′Σ t expanded as
to N p ( µ , Σ) is M X (t ) = e 2 .
Q = ( x − µ )′ Σ −1 ( x − µ ) = (C y )′ Σ −1 (C y ) = y ′ C ′ Σ −1C y = y ′ y
Proof: Since Σ is a symmetric and positive definite, then there exits a non-singular matrix
C such that and the Jacobian of the transformation is J = C .
−1
C′Σ C = I and Σ = C C ′ . Therefore, the density function of Y is
Make the nonsingular transformation
1 1
exp − y ′ y C , since Σ = C C ′ = C , and C = Σ
2 1/ 2
f ( y) = .
X − µ = C Y , then Y = C −1 ( X − µ ) and J = C . (2π ) p/2
Σ
1/ 2 2
Therefore, the density function of Y is
1 1 p 1 1 1 1
1 1 = exp − ∑ yi2 = exp − y12 L exp − y 2p .
f ( y) = exp − (C y + µ − µ ) ′ Σ −1 (C y + µ − µ ) C (2π ) p/2 2 1 / 2 2 1 / 2 2
1/ 2 i =1 (2π ) (2π )
(2π ) p/2
Σ 2
This shows that Y1 , Y2 , L , Y p are independently standard normally variate i.e.
1 1
exp − ( y ′ C ′ Σ −1C y ) , since C = Σ
1/ 2
= p
(2π ) p / 2 2 Y ~ N p (0, I ) , then Yi ~ N (0, 1) . Therefore, Q = ∑ y i2 ~ χ 2p .
i =1
1 1
= exp − y ′ y .
(2π ) p/2 2 For r − th mean
p Q p Q
It shows that Y1 , Y2 , L , Y p are independently normally distributed each with mean zero and 1 ∞ −1 − 1 ∞ + r −1 −
variance one.
E (Q r ) = ∫
2 p / 2 Γ ( p / 2) 0
Qr Q 2 e 2 dQ = ∫ Q2
2 p / 2 Γ ( p / 2) 0
e 2 dQ
Y (1) A
= X , and, E (Q 2 ) = 2 2 p + 1 p = p 2 + 2 p .
1 1 1
t′ µ (C′ t )′ (C′ t ) t ′ µ + t ′CC ′ t t ′ µ + t ′Σ t
=e e2 =e 2 =e 2 . Y ( 2) B 2 2
Exercise: If X ~ N p ( µ , Σ) , then obtain the distribution of quadratic form
Therefore,
Q = ( x − µ ) ′ Σ −1 ( x − µ ) . Also find its r − th mean.
V (Q ) = E (Q 2 ) − [ E (Q )]2 = p 2 + 2 p − p 2 = 2 p .
Solution: Given X ~ N p ( µ , Σ) , then the probability density function of X is
Alternative method
1 1
f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ ) Given X ~ N p ( µ , Σ) , then the probability density function of X is
p/2 1/ 2 2
(2π ) Σ
1 1
Since Σ is a symmetric and positive definite there exits a non-singular matrix C such that f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ ) .
1/ 2 2
C ′ Σ −1 C = I and Σ = C C ′ . (2π ) p / 2 Σ
Multivariate normal distribution 41 42 RU Khan
Let ′
= (C −1Y − C −1C µ ) ′Σ −1 (C −1Y − C −1C µ ) = (Y − C µ )′ C −1 Σ −1C −1 (Y − C µ )
Y = (X − µ )′Σ −1 ( X − µ ) , and its moment generating function
′
Y (1) A Y (1) A
1 −1 = ( 2) − µ (CΣC ′) −1 (2) − µ
1 t ( x − µ )′ Σ −1 ( x − µ ) − 2 ( x −µ )′ Σ ( x −µ ) Y B
Y B
M Y (t ) = E (e t Y ) =
(2π ) p/2
Σ
1/ 2 ∫ e e dx
−1
AΣA′ 0 (Y (1) − A µ )
= [(Y (1) − Aµ )′ (Y ( 2) − B µ )′] (2) .
1
1
− ( x − µ )′ (1−2t ) Σ −1 ( x −µ ) 0 BΣB ′ (Y − B µ )
=
(2π ) p/2
Σ
1/ 2 ∫ e 2 dx
Under the condition A X = 0 , we have Y (1) = 0 , and hence E Y (1) = 0 , we get
1/ 2 −1
(1 − 2t ) Σ −1 1
− ( x − µ )′ (1−2t ) Σ −1 ( x − µ ) AΣA′ 0 0
Q = [0 (Y (2) − B µ ) ′] (Y ( 2) − B µ )
=
1/ 2 1/ 2 ∫e 2 dx
0 BΣ B ′
(2π ) p / 2 Σ (1 − 2t ) Σ −1
= (Y ( 2) − B µ )′ ( BΣB ′) −1 (Y (2) − B µ ) and hence,
1 1
= =
1/ 2 1/ 2 1/ 2 1/ 2 Q = ( X − µ ) ′ Σ −1 ( X − µ ) ~ χ 2p− q , since Y (2) ~ N p −q ( B µ , BΣB ′) .
Σ (1 − 2t ) Σ −1 (1 − 2t ) p / 2 Σ Σ −1
Exercise: If X ~ N p ( µ , Σ) , find the distribution of quadratic form Q = X ′ Σ −1 X .
1
= .
(1 − 2t ) p / 2 Solution: Since Σ is a symmetric and positive definite there exits a non-singular matrix C
such that
But this is the moment generating function of χ 2p , therefore, Y ~ χ 2p .
C ′ Σ −1 C = I and Σ = C C ′ .
Exercise: Suppose X ~ N p ( µ , Σ) and let A be some Y ~ N p (υ , I ) matrix of rank q . If
Let X = C Y a nonsingular transformation, then Y = C −1 X .
A X = 0 , then the quadratic form Q = ( X − µ )′ Σ −1 ( X − µ ) follows a chi-square distribution
Since the transformation is nonsingular, the distribution of Y is also p − variate normal with
on Yi ~ N1 (υ i , 1) degree of freedom.
E (Y ) = C −1 E ( X ) = C −1 µ = υ
Solution: Since Σ is positive definite and A is of full row rank, there exists a ( p − q ) × p
A and
matrix B of rank p − q such that AΣB ′ = 0 and that C = is nonsingular.
B ′ ′
ΣY = E [Y − E (Y )] [Y − E (Y )]′ = C −1 E ( X − µ ) ( X − µ ) ′ C −1 = C −1 Σ C −1 = I
Consider the nonsingular linear transformation
This shows that Y ~ N p (υ , I ) , and Yi ~ N1 (υ i , 1) .
Y = C X , therefore, Y ~ N p (C µ , CΣC ′) .
Therefore,
Now
Q = X ′ Σ −1 X = (C Y ) ′ Σ −1 (C Y ) = Y ′ (C ′ Σ −1C ) Y = Y ′ Y
A A X Y (1)
Y = X = = ( 2) , say
p p
B B X Y = ∑ Yi2 ~ χ 2 , where, ∑υi2 = υ ′ υ = µ ′ Σ −1 µ = non-centrality parameter.
p, ∑ υ i2
i =1 i
i =1
A A AΣA′ AΣB ′ AΣA′ 0
C µ = µ , and CΣC ′ = Σ ( A′ B ′ ) = = .
B B BΣA′ BΣB ′ 0 BΣB ′ Alternative method
This means that Y (1) ~ N q ( Aµ , AΣA′) and Y (2) ~ N p −q ( B µ , BΣB ′) independently. Given X ~ N p ( µ , Σ) , then the probability density function of X is
Therefore, 1 1
f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ ) .
p/2 1/ 2 2
Q = ( X − µ )′ Σ −1 ( X − µ ) = (C −1Y − µ )′Σ −1 (C −1Y − µ ) (2π ) Σ
Multivariate normal distribution 43 44 RU Khan
Let 1
Exercise: If X ~ N p ( µ , Σ) . Show that M X − µ (t ) = exp t ′ Σ t , hence find the moment
Y = X ′ Σ −1 X , and its moment generating function 2
generating function of X .
1
− ( x − µ ) ′ Σ −1 ( x − µ )
1 t x ′ Σ −1 x
M Y (t ) = E (e t Y ) =
(2π ) p / 2 Σ
1/ 2 ∫ e e 2 dx Solution: By definition
1
−1
t′ ( X − µ ) 1 t ′ ( x − µ ) − 2 ( x −µ )′ Σ ( x −µ )
= 1/ 2 ∫
1
t x′Σ −1 x − ( x − µ )′ Σ −1 ( x − µ ) M ( X − µ ) (t ) = E e e e dx
1 (2π ) p / 2 Σ
=
(2π ) p/2
Σ
1/ 2 ∫ e 2 dx.
1
1 − [ ( x − µ )′ Σ −1 ( x − µ )−2 t ′ ( x − µ ) ]
Consider =
(2π ) p / 2 Σ
1/ 2 ∫e 2 dx
1
Q = x ′ Σ −1 x − ( x − µ )′Σ −1 ( x − µ )
2 We know that for A symmetric matrix
1
= − [ x ′ Σ −1 x − µ ′ Σ −1 x − x ′ Σ −1 µ + µ ′ Σ −1 µ − 2 t x ′ Σ −1 x] a ′ A −1 a − 2 a ′ t = (a − At )′ A−1 (a − At ) − t ′ A t
2
1
= − [ x ′ (1 − 2 t )Σ −1 x − 2 x ′ Σ −1 µ + θ ] , where θ = µ ′ Σ −1 µ . Comparing a ′ A −1 a − 2 a ′ u with ( x − µ )′ Σ −1 ( x − µ ) − 2 t ′ ( x − µ ) , we get
2
We know that for A symmetric matrix a = ( x − µ ) , A = Σ , u = t , since a ′ u = u ′ a , so that
a ′ A −1 a − 2 a ′ u = (a − A u )′ A −1 (a − A u ) − u ′ A u 1
1 − [( x − µ −Σ t )′ Σ −1 ( x −µ −Σ t )−t ′Σ t ]
= a ′ A −1 a − u ′ AA−1 a − a ′ A−1 A u + u ′ AA−1 A u − u ′ A u M ( X −µ ) (t ) =
(2π ) p / 2 Σ
1/ 2 ∫e 2 dx
= a ′ A −1 a − 2 a ′ u .
1 ′
t Σt 1 1 ′
− ( x − µ * )′ Σ −1 ( x −µ * )
Comparing a ′ A −1 a − 2 a ′ u = (a − A u )′ A −1 (a − A u ) − u ′ A u with Q , we get e2 t Σt
1/ 2 ∫
= e 2 dx = e2 .
1 (2π ) p / 2 Σ
a = x , A −1 = (1 − 2 t ) Σ −1 , ⇒ A = Σ , u = Σ −1 µ
1− 2t
Also,
Thus,
t′ (X −µ) −t ′ µ ′ −t ′ µ
M ( X − µ ) (t ) = E e =e E ( et X ) = e M X (t )
1
′
1 1 Σ
Q = − x − Σ Σ −1 µ (1 − 2 t )Σ −1 x − Σ Σ −1 µ − µ ′ Σ −1 Σ −1 µ + θ 1
2 (1 − 2 t ) (1 − 2 t ) (1 − 2 t ) t′ µ t ′ µ + t ′Σ t
⇒ M X (t ) = e M ( X − µ ) (t ) = e 2 .
′
1 µ µ θ
= − x − (1 − 2 t )Σ −1 x −
−
+θ . Exercise: For any vector t and any positive definite symmetric matrix W , show that
2 (1 − 2 t ) (1 − 2 t ) (1 − 2 t ) 1 ′
1 ′ −1 1/ 2 t W t
∞ ′
∫−∞ exp − 2 x W x + t x d x = (2π ) W e 2 . Hence find the moment generating
p/2
Therefore,
′ function of p − variate normal distribution with positive definite covariance matrix Σ .
1/ 2 1 µ µ
e tθ /(1− 2 t ) (1 − 2t ) Σ −1 − x − (1− 2t ) Σ −1 x −
2 1− 2 t
1− 2 t
M Y (t ) =
p/2 1/ 2 −1 1 / 2
∫ e dx Solution: Consider the quadratic form
(2π ) Σ (1 − 2t ) Σ 1 ′ −1
− ( x W x − 2 t ′ x ) = Q (say). We know that for A symmetric matrix
2
1
= e (tθ ) /(1− 2 t ) .
(1 − 2t ) p / 2 a ′ A −1 a − 2 a ′ u = (a − A u )′ A −1 (a − A u ) − u ′ A u .
But this is the mgf of χ (2p, θ ) , ⇒ Y ~ χ (2p, θ ) . Comparing a ′ A −1 a − 2 a ′ u with Q , we get a = x , A = W , u = t , since u ′ t = t ′ u .
Multivariate normal distribution 45 46 RU Khan
Thus, µ1
1 1 1 1
1 1 1 I − I I − I µ 2 0
Q = − [ ( x − W t ) ′ W −1 ( x − W t ) − t ′W t ] = − ( x − W t )′ W −1 ( x − W t ) + t ′W t EU = D E X = 4 4 4 4
µ = 0
2 2 2 1 I 1
I
1
− I
1
− I 3
4 4 4 4 4 p×4 p µ
Therefore, 4
1 ∞ 1 1 and
∫ exp − ( x − W t )′ W −1 ( x − W t ) exp t ′W t d x
1/ 2 −∞ 2
(2π ) p / 2 W 2
1 1
I I
1 1 Σ 0 0 0 41 4
exp t ′W t 1 1 1 1
∞ exp − 1 ( x − W t ) ′ W −1 ( x − W t ) d x I − I I − I 0− 4 I I
2 4 0 Σ 0 4
1/ 2 ∫−∞
= 2 ΣU = D Σ Z D ′ = 4 4 4
1 I 1 1 1 0 1 I 1
(2π ) p / 2 W I − I − I 0 0 Σ − I
4 4 4 4 0 0 0 Σ 4 4
1 ′
tWt
1 ′
t Σt − 1 I 1
− I
=e 2 =e2 , since Σ is positive definite symmetric matrix 4 4
1
−t ′ µ t′ µ t ′ µ + t ′Σ t 1 1
= M ( X − µ ) (t ) = e M X (t ) or M X (t ) = e M ( X − µ ) (t ) = e 2 . ( 4 Σ) 0 Σ 0
= 16 =4 .
0 1 1
Exercise: Let X ~ N p ( µ , Σ) and Y ~ N p (δ , Ω) , if U 1 = X + Y , and U 2 = X − Y , find the (4 Σ) 0 Σ
16 4
joint distribution of U 1 , and U 2 .
This shows that U 1 and U 2 are independent and U 1 ~ N p (0 , Σ / 4) , U 2 ~ N p (0 , Σ / 4) .
Solution: Given
Note:
X µ Σ 0
Z 2 p×1 = , E Z = , and Σ Z 2 p×2 p = .
Y δ 0 Ω t2X 2 tr X r
M X (t ) = E e tX = ∫ e tx f ( x) dx = E 1 + t X + +L+ + L
2 ! r !
Consider the transformation
U I I X t2 tr
U = 1 = , i.e. U = C 2 p×2 p Z , then = 1 + t E( X ) + E( X 2 ) + L + E ( X r ) + L
U 2 I − 1 I 2 p×2 p Y 2! r!
2 r
µ I I µ µ + δ t t
E U = C E Z = C = = = 1 + t µ1′ + µ 2′ + L + µ r ′ + L (2.11)
δ I − I δ µ − δ 2! r!
and Differentiating r times equation (2.11) with respect to t and then putting t = 0 , we get
I I Σ 0 I I Σ + Ω Σ − Ω ∂r t2 ∂r
ΣU = C Σ Z C ′ = = . M X (t ) = µ r′ + µ r′ +1 t + µ r′ + 2 + L ⇒ µr′ =
I − I 0 Ω I − I Σ − Ω Σ + Ω r
M X (t ) .
∂ t t =0 2! t =0 ∂tr t =0
Exercise: Let X 1 , X 2 , X 3 and X 4 be independent N p ( µ , Σ) random vectors, if
1 1 1 1 1 1 1 1
U 1 = X 1 − X 2 + X 3 − X 4 , and U 2 = X 1 + X 2 − X 3 − X 4 , find the joint
4 4 4 4 4 4 4 4
distribution of U 1 , and U 2 , and their marginals.
Solution: Consider the transformation
X1
1 1 1 1
U I − I I − I
X2
U = 1 = 4 4 4 4
X , i.e. U = C 2 p×4 p X 4 p×1 , then
U 2 1 I 1
I
1
− I
1
− I 3
4 4 4 4 2 p×4 p X
4
Multivariate normal distribution 47 48 RU Khan
1 ∂ Fourth moment
= M µ n + ( t1 (t1σ 11 + t 2 σ 12 + L + t nσ 1n + L + t p σ 1 p )
2 ∂ tn
∂4M ∂ ∂ 3 M
+ L + t n (t1σ n1 + L + t nσ nn + L + t p σ np ) E[X n Xl X r X m] = =
∂ t n ∂ tl ∂ t r ∂ t m
t =0
∂ tm ∂ t n ∂ t l ∂ t r
t =0
+ L + t p (t1σ p1 + L + t nσ pn + L + t p σ pp ))}t =0
∂ p p p
p = M µ r + ∑ t j σ rj µ l + ∑ t j σ lj µ n + ∑ t j σ nj
1
= M µ n + (t1σ 1n + L + ∑ t jσ nj + t n σ nn + L + t p σ pn ) ∂ tm
2 j =1 j =1 j =1
j =1 t =0
p p p
1
p p p + M µ n + ∑ t j σ nj σ lr + M µ l + ∑ t j σ lj σ nr + M µ r + ∑ t j σ rj σ nl
= M µ n + ∑ t jσ jn + ∑ t j σ nj = M µ n + ∑ t j σ nj = µ n , as e 0 = 1
2 j =1 j =1 j =1 j =1 t =0
j =1
t = 0
j =1
t = 0
p p p p
= M µ m + ∑ t j σ mj µ r + ∑ t j σ rj µ l + ∑ t j σ lj µ n + ∑ t j σ nj
Second moment
∂
p j =1 j =1 j =1 j =1
∂2M ∂ ∂ M
E[X n Xl ] = = = M µ n + ∑ t j σ nj
∂ tl ∂ t n ∂ tl ∂ tn ∂ tl p p p
t =0 j =1
+ M σ rm µ l + ∑ t j σ lj µ + t σ +σ µ + t σ
t =0
n ∑ j nj lm r ∑ j rj
t =0
j =1 j =1 j =1
p p
= M µ l + ∑ t j σ lj µ n + ∑ t j σ nj + M σ nl = µ l µ n + σ nl .
p p p
j =1 j =1 µ + t σ + µ + t σ µ + t σ σ
n ∑ j nj r ∑ j rj l ∑ j lj nm
t =0
j =1 j =1 j =1
Therefore,
E ( X n2 ) = µ n2 + σ nn p p
+ M µ m + ∑ t j σ mj µ n + ∑ t j σ nj σ lr + M σ nm σ lr
and
V ( X n ) = E ( X n2 ) − [ E ( X n )]2 = µ n2 + σ nn − µ n2 = σ nn . j =1 j =1
Cov ( X n , X l ) = E ( X n X l ) − E ( X n ) E ( X l ) = µ n µ l + σ nl − µ n µ l = σ nl . p p
+ M µ m + ∑ t j σ mj µ l + ∑ t j σ lj σ nr + M σ lm σ nr
Third moment j =1 j =1
∂ 2 M p p
∂ 3M ∂ + M µ m + ∑ t j σ mj µ r + ∑ t j σ rj σ nl + M σ rm σ nl
E[X n Xl X r ] = =
∂ t n ∂ tl ∂ t r ∂ tr ∂ t n ∂ t l
t =0 t =0 j =1 j =1 t =0
Multivariate normal distribution 49 50 RU Khan
= µ m µ r µ l µ n + µ l µ nσ rm + µ r µ nσ lm + µ r µ l σ nm + µ m µ nσ lr + σ nmσ lr p p p p
= M ∑ t j σ mj ∑ t j σ rj ∑ t j σ lj ∑ t j σ nj
+ µ m µ l σ nr + σ lmσ nr + µ m µ r σ nl + σ rmσ nl . j =1
j =1 j =1 j =1
∂3M ∂ ∂ 2 M
E [ X n − µ n ] [ X l − µl ] [ X r − µ r ] = =
∂ t n ∂ tl ∂ t r
t =0
∂ tr ∂ t n ∂ t l
t =0
p p
∂
= M ∑ t j σ lj ∑ t j σ nj + M σ nl
∂ tr j =1
j =1 t =0
p p p
= M ∑ t j σ rj ∑ t j σ lj ∑ t j σ nj
j =1 j =1 j =1
p p p
+ M σ lr ∑ t j σ nj + σ nr ∑ t j σ lj + M ∑ t j σ rj σ nl = 0.
j =1 j =1 j =1 t =0
∂4M
E [ X n − µ n ] [ X l − µl ] [ X r − µ r ] [ X m − µ m ] =
∂ t n ∂ tl ∂ t r ∂ t m
t =0
∂ p p p p
= M ∑ t j σ rj ∑ t j σ lj ∑ t j σ nj + M ∑ t j σ nj σ lr
∂ tm j =1
j =1 j =1 j =1
p p
+ M ∑ t j σ lj σ nr + M ∑ t j σ rj σ nl
j =1 j =1 t =0
52 RU Khan
individuals studied.
aij = ∑ ( xiα − xi ) ( x jα − x j ) , for all i and j
α =1
Characteristic Individual Mean n n n n
1 2 L α L n = ∑ [ xiα ( x jα − x j ) − xi ( x jα − x j )] = ∑ x iα x jα −xj ∑ x iα − x i ∑ x jα + n xi x j
α =1 α =1 α =1 α =1
X1 x11 x12 L x1α L x1n x1
n
X2 x 21 x 22 L x 2α L x 2n x2 = ∑ x iα x jα − n x i x j .
M M M M M M M M α =1
Xi xi1 xi 2 L x iα L xin xi
Some results
M M M M M M M M
1) Consider a quadratic form
Xp x p1 x p 2 L x pα L x pn xp
p
x1 x2 L xα L xn Q = x ′ A x = ∑ aij xi x j , where A is symmetric, then
i, j
2 4 7 / 2 − 2 7 6 1 1 1 1× 0
Example: Let A = , A = 14 − 12 = 2 , and A −1 = , A −1 = − = . −
3 7 − 3 / 2 1 2 2 2 2 ×1 2 ×1 2 ×1
1 1 −2
A=
A 7 A −4 A − 3 22 A22 2
a11 = 11 = , a12 = 21 = = −2 , a 21 = 12 = , a = = = 1. 3× 2 3× 2 3× 2
A 2 A 2 A 2 A 2 1 1 1
Also 3 3 3
1 1 1× 0
A11 1 A 21 2 A12 3/ 2 − 0 L 0
a11 = = = 2, a12 = = = 4, a 21 = = =3, 2 ×1 2 ×1 2 ×1
A −1 1/ 2 A −1 1/ 2 A −1 1/ 2
1 1 −2
0 L 0
A 22
7/2 3 ×2 3× 2 3× 2
a 22 = = =7. A= M M M M M M .
−1 1/ 2 1 1 1 − (n − 1)
A L
n (n − 1) n (n − 1) n (n − 1) n (n − 1)
5) A square matrix An × n is said to be orthogonal if A′A = AA′ = I n , and if the 1 1 1 1
L
transformation Y = A X transform X ′ X to Y ′ Y . Also n n n n
np n 1 n ′
log φ = − log (2π ) − log Σ − ∑ ( x α − µ )′ Σ −1 ( x α − µ ) . 1 1
Σ x = E ( x − µ ) ( x − µ )′ = E ( x1 + L + x n ) − µ ( x1 + L + x n ) − µ
2 2 2 α =1 n n
Differentiating with respect to µ and equating to zero 1
= E ( x1 + L + x n − n µ ) ( x1 + L + x n − n µ )′
n2
n n
∂ log φ 1
= 0 = −0 − 0 − ∑ 2 Σ −1 ( µ − x α ) or Σ −1 ∑ (µ − x α ) = 0 1
∂µ 2 α =1 = [ E ( x1 − µ ) ( x1 − µ )′ + L + E ( x n − µ ) ( x n − µ )′ + 0] , as xα ' s are independent.
α =1
n2
1 n 1
or µˆ = ∑x =x .
n α =1 α
= ( n Σ) = Σ / n .
n2
Thus,
Maximum likelihood estimate of variance covariance matrix
x ~ N ( µ , Σ / n) .
Let Σ −1 = (σ ij ) and Σ = (σ ij ) then the Σ −1 = σ i1Σ i1 + L + σ ip Σ ip , where Σ ij is the
A
Theorem: is an unbiased estimate of Σ .
ij −1 Σ ij th −1 −1 th n −1
cofactor of σ in Σ , therefore, = (i j ) element of (Σ ) = (i j ) element of
Σ −1 Proof: We have
Σ = σ ij . A = ∑ ( x α − x ) ( x α − x )′ = ∑ [( x α − µ ) − ( x − µ )][ ( xα − µ ) − ( x − µ )]′
α α
Now, the logarithm of the likelihood function is
= ∑ [( x α − µ ) ( x α − µ )′ − ( x − µ ) ( x α − µ )′ − ( x α − µ ) ( x − µ ) ′ + ( x − µ ) ( x − µ )′]
np n 1 n α
log φ = − log (2π ) + log Σ −1 − ∑ ( x α − µ )′ Σ −1 ( x α − µ )
2 2 2 α =1 = ∑ ( x α − µ ) ( x α − µ )′ − n ( x − µ ) ( x − µ ) ′ .
α
np n
=− log (2π ) + log (σ i1Σ i1 + L + σ ij Σ ij + L + σ ip Σ ip ) Taking expectation on both the sides
2 2
1 E ( A) = ∑ E ( x α − µ ) ( x α − µ )′ − n E ( x − µ ) ( x − µ ) ′ = n Σ − n Σ / n = (n − 1) Σ
− ∑ ∑ σ ij ( xiα − µi ) ( x jα − µ j ) , since x′ A x = ∑ aij xi x j
2 α i, j
α
i, j
A
or E = Σ , this shows that the maximum likelihood estimate of Σ is biased, i.e.
Differentiating with respect to σ ij and equating to zero, we get n −1
A n −1
∂ log φ n Σ ij 1 n ∂ f ′( x) E = Σ.
=0= − ∑ ( xiα − µ i ) ( x jα − µ j ) , because log f ( x ) = n n
ij 2 Σ −1 2 α =1 ∂x f ( x)
∂σ
1
Theorem: Given a random sample x1 , x 2 , L , x α , L , x n from N p ( µ , Σ) , x = ∑ x α and
n n n α
n 1 1
or σ ij = ∑ ( xiα − µ i ) ( x jα − µ j ) or σˆ ij = ∑ ( xiα − µˆ i ) ( x jα − µˆ j ) A = ∑ ( x α − x ) ( x α − x )′ . Thus x and A are independently distributed.
2 2 α =1 n α =1
α
A Proof: Make an orthogonal transformation
Hence, Σ̂ = .
n
y = c11 x1 + L + c1n x n
1
Theorem: Given x1 , x 2 , L , x α , L , x n be an independent random sample of size n(> p )
from N p ( µ , Σ) , then x ~ N ( µ , Σ / n) . M M M
y = c n −11 x1 + L + c n−1n x n
n −1
1 1
Proof: E ( x ) = E ∑ x α = E ( x + L + x ) = 1 (n µ ) = µ and
n n 1 n
n 1 1
α y = x1 + L + xn
n n n
Multivariate normal distribution 57 58 RU Khan
y x1
n
1 = x1 x1′ + L + x n x n ′ = ∑ x i x i ′ .
or Y = C X , where Y = M , X = M , and i =1
x
y n
n Therefore,
c11 L c1n
c12 n n
M MM M
∑ y i y i′ − y n y n′ = ∑ x i x i ′ − y n y n′
i =1 i =1
C = c n −11 c n−12 L c n−1n is orthogonal.
n−1 n
1
1
L
1
or ∑ y i y i′ = ∑ xi xi′ − nx x ′ = A (3.2)
n n n i =1 i =1
Since C is orthogonal In view of equations (3.1) and (3.2), x and A depends on two mutually exclusive sets,
n
1 n which are independently distributed, therefore, x and A are independently distributed.
∑ cik n
= 0, i = 1, 2, L , n − 1 ⇒ ∑ cik = 0
k =1 k =1 Test for µ , when Σ is known
Also
Given a random sample x1 , x 2 , L , x α , L , x n from N p ( µ , Σ) . The hypothesis of interest is
n
0, i ≠ j
∑ cik × c jk = 1, i = j for i, j = 1, 2, L, n . H 0 : µ = µ , where µ is a specified vector, then, under H 0 , the test statistic is
0 0
k =1
n ( x − µ ) ′ Σ −1 ( x − µ ) ~ χ 2p .
Now consider, 0 0
p
Let χ 2p (α ) be the number such that Pr [ χ 2p ≥ χ 2p (α )] = α
∑ δ i2 = δ δ ′ = n (µ − µ 0 )′ Σ −1 ( µ − µ 0 ) = non-centrality parameter. Then for testing H 0 , we use the critical region
i =1
n1n2
The confidence region for µ is the set of possible values of µ satisfying ( x (1) − x ( 2) )′ Σ −1 ( x (1) − x (2) ) ≥ χ 2p (α ) .
n1 + n 2
n ( x − µ ) ′ Σ −1 ( x − µ ) ≤ χ 2p (α ) , this has confidence coefficient 1 − α .
Note
Two sample problem i) If H 0 is not true, then
Given x1(1) , x (21) , L , x α(1) , L , x (n1) be a random sample from N p ( µ (1) , Σ) and E y = C −1 E ( x (1) − x (2) ) = C −1 ( µ (1) − µ ( 2) ) = δ (say), and Σ y = I
1
x1( 2) , x (22) , L , x α(2) , L , x (n2) from N p ( µ (2) , Σ) . H 0 : µ (1) = µ ( 2) , then, in this case, the So each y i ~ N (δ i , 1) , then
2
statistic is p
n1n 2 ( x (1) − x ( 2) )′ Σ *−1 ( x (1) − x (2) ) = ∑ yi2 ~ χ 2 ,
( x (1) − x (2) ) ′ Σ −1 ( x (1) − x (2) ) ~ χ 2p . i =1
p, ∑ δ i2
n1 + n 2 i
1 1 n1n 2
x (1) − x (2) ~ N p µ (1) − µ ( 2) , + Σ . = ( µ (1) − µ ( 2) ) ′ Σ −1 ( µ (1) − µ (2) )
n
1 n 2 n1 + n 2
ii) If the two populations have known covariance matrix Σ1 and Σ 2 , then 1 1
= exp − n ( x − µ ) ′ Σ −1 ( x − µ )
n/2 2
* −1 Σ Σ (2π ) np / 2 Σ
( x (1) − x ( 2) )′ Σ ( x (1) − x ( 2) ) ~ χ 2p , where Σ* = 1 + 2 .
n1 n2
1 1
× exp − tr Σ −1 A .
n/2 2
Result: (2π ) np / 2 Σ
n
1
If t is a sufficient statistic for θ if ∏ f ( xα ; θ ) = g (t ; θ ) h( x1 ,L , x n ) , where f ( x α ; θ ) is Thus, x and
n
A form a sufficient set of statistics for µ and Σ . If Σ is known, x is a
α =1
the density of the α − th observation, g (t ; θ ) is the density of t and h( x1 , L , x n ) does not 1
sufficient statistic for µ . However, if µ is known A is not a sufficient statistic for Σ , but
depend on θ . n
1
Sufficient Statistics for µ and Σ
∑ ( x − µ ) ( xα − µ ) is a sufficient statistic for Σ .
n α α
′
1 1 n
exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ ) .
np / 2 n/2
(2π ) Σ 2 α =1
Consider,
n n n
∑ ( xα − µ )′ Σ −1 ( x α − µ ) = tr ∑ ( xα − µ )′ Σ −1 ( x α − µ ) = tr ∑ Σ −1 ( xα − µ ) ′ ( xα − µ)
α =1 α =1 α =1
n
= trΣ −1 ∑ ( xα − µ ) ′ ( xα − µ ) .
α =1
We can write
n n
∑ ( xα − µ )′ ( x α − µ ) = ∑ [( xα − x ) + ( x − µ )] [( x α − x ) + ( x − µ )]′
α =1 α =1
n
= ∑ [( xα − x ) ( x α − x )′ + ( x α − x ) ( x − µ )′ +( x − µ ) ( x α − x )′ + ( x − µ ) ( x − µ )′]
α =1
n ′
= ∑ ( xα − x ) ( xα − x )′ + ∑ ( xα − x ) ( x − µ )′ +( x − µ ) ∑ ( xα − x )
α =1 α α
+ n ( x − µ ) ( x − µ )′
= A + n ( x − µ ) ( x − µ )′ , because ∑ ( xα − x ) = ∑ xα − n x = 0 .
α α
Thus the density of x1 , L , x n can be written as
1 1
exp − tr Σ −1 [ A + n ( x − µ )( x − µ )′]
np / 2 n/2 2
(2π ) Σ
1 1
= exp − [n ( x − µ ) ′ Σ −1 ( x − µ ) + tr Σ −1 A]
np / 2 n/2 2
(2π ) Σ
64 RU Khan
WISHART DISTRIBUTION M M M M M M M …
b p1 0 0 0 0 0 0 L b11
If x1 , x 2 , L , x n are independent observations from N ( µ , σ 2 ) , it is well known
b p2 0 0 0 0 0 0 L 0 b22
n
(n − 1) s = ∑ ( xi − x ) ~ σ
2 2 2
χ n2−1 . The multivariate analogue of (n − 1) s 2 is the matrix A M M M M M M M L M M
i =1
b pp 0 0 0 0 0 0 L 0 0 L 2b pp
and is called Wishart matrix. In other words the Wishart matrix is defined as the p × p
symmetric matrix of sums of squares and cross products (of deviations about the mean) of the
p
sample observations, from a p − variate nonsingular normal distribution. The distribution of ∂A
∴
p p
= 2 p b11 p −( p −1)
b22 L b pp = 2 p ∏ biip −(i−1) .
A when the multivariate distribution is assumed normal is called Wishart distribution and is ∂B i =1
a generalization of χ 2 distribution in the univariate case.
Theorem: Let x1 , x 2 , K , x n be a random sample from N p (µ , I ) ,
p ( p + 1)
By definition of A , we mean the joint distribution of the distinct elements aij , n n −1
2
(i, j = 1, 2, L , p ; i ≤ j ) of the symmetric matrix A .
A= ∑ ( xα − x ) ( xα − x ) ′ = ∑ Z α Z ′α , where Z α are independent, each distributed
α =1 α =1
n−1
Results:
1) Given a positive definite symmetric matrix A , there exists a nonsingular triangular
according to N p (0, I ) . Then the density of A = ∑ Z α Z ′α is
α =1
matrix B such that A = BB ′ .
(υ − p −1) / 2
p A exp (− 12 tr A)
∂A , where υ = n − 1 .
2) The Jacobian of transformation for the distinct element of B is = 2 p ∏ (bii ) p −(i −1) . p
∂B 2υ p / 2 π p ( p −1) / 4
i =1 ∏ Γ(υ − i + 1) / 2
Proof: The equation A = BB ′ can be written as i =1
Proof: Consider a nonsingular triangular matrix B such that
a11 a12 L a1 p b11 0 L 0 b11 b12 L b1 p
A = BB ′ (4.1)
a 21 a 22 L a 2 p b21 b22 L 0 0 b22 L b2 p
M =
M M M M M M M M M M M Let
a p1 a p 2 L a pp b p1 b p 2 0 L b pp
0 L 0 B 1
L b pp 0 ′
b11 b11 0 L 0
b2 L b21 b22 L 0 B ′ 2 b21 b22 L 0
11 b11b21 b11b p1 B = = , and Brr = , so that
b b M M M M M M M M M
2 2
b21 + b22 L b21b p1 + b22 b p 2
= 21 11 b p1 b p 2 L b pp ′ b
M M M M B p r1 br 2 L brr
2
b p1b11 b p1b21 + b p 2 b22 L b 2p1 + b 2p 2 + L + b pp
B pp = B .
∂A ∂ (a11 , a 21 a 22 , a31 a32 a33 , L , a p1 a p 2 L a pp ) Let f (B) = joint density function of nonzero elements of B , then we can write
=
∂B ∂ (b11 , b21 b22 , L , b p1 b p 2 L b pp )
p −1
f ( B ) = f ( B ′1 ) f ( B ′ 2 B11 ) L f ( B ′ p B p−1 p −1 ) = ∏ f ( B ′ i +1 Bii ) (4.2)
∂ a11 a 21 a 22 a31 a 32 a33 … a p1 a p 2 L a pp
i =0
b11 2b11 … Let
b21 0 b11 … B ′. r = (br1 br 2 L br r −1 ) , so that
b22 0 0 2b22 …
b31 0 0 0 b11 … B ′ r = ( B ′. r brr 0 L 0)
b32 0 0 0 0 b22 … The equation (4.1) can be explained as
b33 0 0 0 0 0 2b33 …
Wishart distribution 65 66 RU Khan
a11 L a1i +1 L a1 p b11 L 0 L 0 b11 L bi +11 L b p1 = (bi2+11 + bi2+12 + L + bi2+1i+1 ) − (bi2+11 + bi2+12 + L + bi2+1i ) ,since A = BB ′
M M M M M M M M M M M M M M M
a L a L a = b L b L 0 0 L bi+1i L b pi = bi2+1i +1 ~ χυ2−i (4.8)
i1 ii +1 ip i1 ii
M M M M M M M M M M M M M M M
a L b pp 0 L L b pp As reg SS and error SS are independent, so that bi +1i+1 is independent of B . i+1 .
p1 L a pi +1 L a pp b p1 L b pi 0
Let Therefore,
B .i +1 = Bii′ C , from equation (4.5), since B′ii is fixed with respect to Z i +1 , then p −1
because ∏ π i / 2 = π [1+2+L+( p−1)] / 2 = π p( p−1) / 4 .
E B . i +1 = Bii′ E C = 0 , Σ B.i +1 = Bii′ Aii−1 Bii = Bii′ ( Bii Bii′ ) −1
Bii = I . i =0
p
Further, B . i +1 being a linear combination of regression coefficients which are themselves
∏ (bii )υ −i 1 p i
normally distributed, we conclude that f ( B) = i =1
exp − ∑ ∑ b 2j i
p
2 i =1 j =1
B . i +1 ~ N i (0, I ) (4.6) 2 p (υi −2) / 2 π p ( p−1) / 4 ∏ Γ(υ − i + 1) / 2
i =1
Re g SS = C ′ a .i +1 , because if the normal equation Q = S β , then Re g SS is β ′ Q
Now
= [( Bii′ ) −1
B .i +1 ]′ Bii B . i+1 = B ′. i +1 ( Bii ) −1 Bii B .i +1 = B ′. i +1 B . i+1 ∂B 1 1
= = .
∂A ∂A p
p p −i +1
= bi2+11 + bi2+12 + L + bi2+1i ~ χ i2 (4.7) ∂B 2 ∏ (bii )
i =1
Error SS = Total SS − Re g SS
p p i
= ai +1i +1 − (bi2+11 + bi2+12 + L + bi2+1i ) Also, tr A = the sum of the diagonal elements of A = ∑ Aii = ∑ ∑ b 2j i .
i =1 i =1 j =1
Wishart distribution 67 68 RU Khan
Therefore, and
p (υ − p −1) / 2
∏ (bii )υ − p−1 A*
* 1
f ( B) i =1 1 f (A ) = exp − tr A*
f ( A) = = exp − tr A p 2
p p 2 2υ p / 2 π p( p −1) / 4 ∏ Γ(υ − i + 1) / 2
2 p ∏ (bii ) p −i +1 2υ p / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2 i =1
i =1 i =1
*
∂A
Consider, f ( A) = f [ A* ( A)] .
∂A
p
υ − p −1
∏ (bii )υ − p−1 = (b11 b22 L b pp )υ − p−1 = B Now
i =1
∂A* ∂ BAB p +1
p = = B , and tr A* = tr BAB ′ = tr AB ′B = tr AΣ −1
2 1/ 2 (υ − p −1) / 2 ∂A ∂A
A = B B′ = B ⇒ B = A ⇒ ∏ (bii )υ − p−1 = A
2
i =1 From BΣB ′ = I , we have 1 = BΣB ′ = B Σ B ′ = Σ BB ′ = Σ B
(υ − p −1) / 2
A 1 and
f ( A) = exp − tr A .
p 2 1 1 A
2υp / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2 BB ′ = , B = and A* = BAB ′ = .
Σ 1/ 2 Σ
i =1 Σ
This is the form of wishart distribution when Σ = I and is denoted by W p (υ , I ) . Therefore,
(υ − p −1) / 2
Theorem: Suppose the p − component vectors x1 , x 2 ,L , x n (n > p) are independent, A exp (− 12 tr A Σ −1 ) 1
each distributed according to N p ( µ , Σ) , then the density of f ( A) =
p ( p +1) / 2
(υ − p −1) / 2 Σ
n n −1 2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2
A= ∑ ( xα − x ) ( x α − x ) ′ = ∑ Z α Z ′α , where, Z α ~ N p (0, Σ) is i =1
α =1 α =1 (υ − p−1) / 2
A exp (− 12 tr A Σ −1 )
A
(υ − p −1) / 2
exp (− 1 tr A Σ −1 ) = , ⇒ A ~ W p (υ , Σ) .
2 p
, where υ = n − 1 . υ/2
p 2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2
υ/2
2υp / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2 i =1
i =1 Theorem: Let x1 , x 2 ,L , x n (n ≥ p + 1) are distributed independently, each according to
Proof: Since Σ is symmetric and positive definite there exist a nonsingular triangular
1 n
matrix B such that N p ( µ , Σ) . Then the distribution of S = ∑ ( x − x ) ( xα − x )′ is W p (υ , Σ / υ ) , where
n − 1 α =1 α
BΣB ′ = I ⇒ B ′B = Σ −1 . υ = n −1.
Make the transformation Proof: Clearly,
Z *α = B Z α , α = 1, 2,L , n − 1 . A n
1 1
′
1
S= = ∑ Z Zα , where Z α ~ N p (0, Σ / n − 1) .
n − 1 α =1 n − 1 α
*
E (Z α ) = B E (Z α ) = 0 , Σ Z * = B E ( Z α Z ′α ) B ′ = B Σ B ′ = I . n −1 n −1
α
Since Σ is symmetric and positive definite there exist a nonsingular triangular matrix B such
Thus, that
n−1 ′ n −1 −1
= B ∑ Z α Z ′α B ′ = B A B ′ = A* BΣ * B ′ = I , where Σ* = Σ / n − 1 and B ′B = Σ* .
Z *α ~ N p (0, I ) , and ∑ Z α* Z α*
α =1 α =1 Make the transformation
⇒ A* ~ W p (υ , I ) 1
Z *α = B Zα , α = 1, 2,L , n − 1 .
n −1
Wishart distribution 69 70 RU Khan
* 1 1
′ We know that the characteristic function of a vector x ′ = ( x1 , x 2 , L , x p ) is defined as
E (Z α )=0, Σ = B E Z α Zα B ′ = B Σ* B ′ = I .
Z α*
n −1 n −1 ′
Ee it x , where t ′ x = t1 x1 + t 2 x 2 + L + t p x p (4.10)
Thus,
n−1 ′ n−1 1 1
′ In view of (4.10), we can write (4.9) as E (e i tr AΘ ) .
Z *α ~ N p (0, I ) , and ∑ Z α* Z α* = B ∑ n − 1 Z α Zα B′ = B S B′ = S *
α =1 α =1 n −1 Theorem: If Z 1 , Z 2 , L , Z n−1 are independent, each with distribution N p (0, Σ) , the
n−1
Hence, characteristic function of A11 , A22 , L , A pp , 2 A12 , L , 2 A p −1 p , where A = ( Aij ) = ∑Zα Zα′
(υ − p −1) / 2 α =1
S*
1 is given by
S * ~ W p (υ , I ) , and f (S * ) = exp − tr S *
p 2 −υ / 2
2υ p / 2 π p ( p −1) / 4 ∏ Γ (υ − i + 1) / 2 φ A (Θ) = E e i tr AΘ = I − 2 i ΘΣ , where n − 1 = υ .
i =1
Proof: We have
* ∂S * υ υ
f ( S ) = f [S ( S )] . i tr ∑ Z α Z α ′Θ i tr ∑ Z α ′Θ Z α
∂S
φ A (Θ) = E e α =1 = Ee α =1 , by virtue of the fact that
Now
−1 tr EFG = ∑ eij f jk g ki = tr FGE
∂S * ∂ BSB ′ p +1 Σ
= = B , tr S * = tr BSB ′ = tr S B ′B = tr S
∂S ∂S n − 1 υ
′
i ∑ Z α Θ Zα υ
i Z α ′Θ Z α
*
From BΣ B ′ = I , we have 1 = BΣ B ′ = B Σ * *
B′ = Σ *
BB ′ = Σ *
B
2
, thus, = Ee α =1 = ∏Ee , as Z α ' s are identical distributed random
α =1
S S vector.
1 1
BB ′ = , B = , and S * = BSB ′ = = .
1/ 2 Σ i Z ′Θ Z υ
Σ* Σ* Σ* = E (e ) , where Z ~ N p (0, Σ) (4.11)
n −1
Since Θ is real and Σ is positive definite, there exist a real nonsingular matrix C such that
Therefore,
C ′ Σ −1C = I , and C ′ΘC = D , where D is a real diagonal matrix, diagonal elements of D is
Σ
−1
(υ − p −1) / 2
S exp − 1 tr S d jj .
2 n − 1
f (S ) = . Consider the transformation
υ/2 p
Σ
2υ p / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2
n −1 Z = CY , then EY = 0 , and ΣY = C −1 E Z Z ′ (C −1 )′ = (C ′ Σ −1C ) −1 = I
i =1
This is the density of W p (υ , Σ / υ ) . ⇒ Y ~ N p (0, I ) i.e. Y j , j = 1, 2,L , p are independently distributed as N (0, 1) , therefore,
p
Characteristic function i ∑ d jj y 2j
i Z ′ ΘZ i Y ′C ′ Θ C Y i Y ′ DY j =1
Consider Ee = Ee = Ee = Ee
A11 A12 p
i d jj y 2j
p p
A = , the elements of the matrix are A11 , A22 , 2 A12 , because A12 = A21 . = ∏ ( ch. function of χ12 ) = ∏ (1 − 2 i d jj ) −1/ 2
A21 A22 = ∏Ee
j =1 j =1 j =1
Introduce a real matrix Θ = (θ ij ) , with θ ij = θ ji
−1 / 2
= I − 2i D , as ( I − 2 i D ) is a diagonal matrix. (4.12)
θ11 θ12 A θ + A12θ 21 A11θ12 + A12θ 22
Θ = then, AΘ = 11 11
θ
21 θ 22 A21θ11 + A22θ 21 A21θ12 + A22θ 22 Moreover,
2
tr AΘ = A11θ11 + A12θ 21 + A21θ12 + A22θ 22 = A11θ11 + A22θ 22 + 2 A12θ12 (4.9) I − 2 i D = C ′ Σ −1 C − 2 i C ′ Θ C = C ′ (Σ −1 − 2 i Θ) C = C Σ −1 − 2 i Θ .
Wishart distribution 71 72 RU Khan
C ′ Σ −1 C = I or C ′ Σ −1C = 1 or C′ Σ −1
C = 1 or C′ C Σ −1
=1 φ A2 (Θ) = I − 2 i ΘΣ
−υ2 / 2
Therefore, Then Z α(1) are independent, each with distribution N q (0, Σ11 ) , because Z α(1) and Z α( 2) are
q q independent, so that Σ12 = Σ 21 = 0 , and
A = ∑ A j ~ W p ∑υ j , Σ .
n−1
j =1 j =1 ′
A11 = ∑ Z α(1) Z α(1) , ⇒ A11 ~ Wq (υ , Σ11 ) .
Theorem: If A ~ W p (n − 1, Σ) , then the distribution of l ′ A l ~ (l ′ Σ l ) χ n2−1 , where l is a α =1
Theorem: If A ~ W p (n − 1, Σ) , and if a is a positive constant, then aA ~ W p (n − 1, aΣ) . If Σ ij = 0 for i ≠ j and if A is distributed according to W p (υ , Σ) , then A11 , A22 , L , Aqq
are independently distributed and A jj is distributed according to W (υ , Σ jj ) .
Proof: Since A ~ W p (n − 1, Σ) , there are n − 1 independent p − component random vectors
n−1
Z 1 , L , Z n−1 each distributed as N p (0, Σ) , such that
Proof: Let A = ∑ Z α Z α ′ , where Z α are independent, each with distribution N p (0, Σ) .
n−1 α =1
A= ∑ Z α Z α ′ . Evidently, we have Let us partition Z α as
α =1
Z (1)
n −1 α
aA = ∑( a Zα )( a Zα′) and a Z 1 , L , a Z n−1 are independently identically Z α = M . Since Σ ij = 0 for i ≠ j , so Z α(i) and Z α( j ) are independent, then
α =1
Z (q )
distributed as normal N p (0, aΣ) . Thus α
n−1 ′ n−1 ′
aA ~ W p (n − 1, aΣ) . Aii = ∑ Z α(i ) Z α(i ) is independent of A jj = ∑ Z α( j ) Z α( j ) , where
α =1 α =1
Theorem: Let A and Σ be partitioned into q and p − q rows and columns as
Z α(i ) ~ N p (0, Σ ii ) , and Z α( j ) ~ N p (0, Σ jj ) , thus, Z α(1) , Z α( 2) ,L , Z α(q ) are independent.
A11 A12 Σ Σ12
A = , Σ = 11
A21 A22 Σ 21 Σ 22 Hence,
If A is distributed according to W p (υ , Σ) , then A11 is distributed according to Wq (υ , Σ11 ) n−1 ′
(the marginal distribution of some sets of elements of A ).
Aii = ∑ Z α(i ) Z α(i ) , i = 1, 2, L, q are independently distributed.
α =1
Proof: Given A ~ W p (υ , Σ) , where Theorem: Let A and Σ be partitioned into q and p − q rows and columns
n−1
A11 A12 Σ Σ12
A= ∑ Z α Z α ′ , and Z α are independent, each with distribution N p (0, Σ) . A = , Σ = 11
A22
α =1 A21 Σ 21 Σ 22
Partition Z α into sub vectors of q and p − q components as −1
If A is distributed according to W p (υ , Σ) , then A11 − A12 A22 A21 is distributed according to
Z (1) −1
Wq (υ − ( p − q), Σ11 − Σ12 Σ 22 Σ 21 ) (the conditional distribution of some sets of elements of
Zα =
α
.
(2) A ).
Z
α
Wishart distribution 75 76 RU Khan
Thus, n−i
h 2h Γ p + h
Σ
∏ Γ( n − i) / 2 .
2
b11. 2,L, p has χ − distribution with (n − p) degree of freedom. 2
=
(n − 1) ph i =1
*
Since Z α ' s are independently distributed For h = 1
⇒ (b11. 2,L, p ) (b22.3,L, p ) L (b p −1 p −1. p ) b pp are independently distributed n−i n−i n−i
2Γ + 1
p p 2 Γ
Σ 2 = Σ 2 2
⇒ bii . i+1,L, p has χ 2 − distribution with n − 1 − ( p − i ) degrees of freedom. E S = ∏ ∏
(n − 1) p i =1 Γ(n − i ) / 2 (n − 1) p i =1 Γ( n − i ) / 2
Therefore, p
Σ
B is distributed as χ n2−1− ( p −1) . χ n2−1−( p − 2 ) .L . χ n2− 2 . χ n2−1 . = ∏ (n − i ) .
(n − 1) p i =1
1 1 For h = 2
Since S = A = Σ B , then,
(n − 1) p (n − 1) p
n−i n − i n − i n − i
24Γ + 2
p 2 p 4 + 1 Γ
1 2 Σ 2 Σ 2 2 2
S is distributed as Σ χ n2−1. χ n2− 2 L χ n2−1−( p −2) . χ n2− p . E S = ∏ = ∏
(n − 1) p (n − 1) 2 p i=1 Γ (n − i ) / 2 (n − 1) 2 p i =1 Γ( n − i) / 2
For p = 1 Σ
2 p
Σ
= ∏ ( n − i + 2) ( n − i ) .
S = χ n2−1 ⇒
(n − 1) s11
~ χ n2−1 or
a11
~ χ n2−1 . (n − 1) 2 p i =1
n −1 σ 11 2
σ 2 p 2 p
Σ Σ
V ( | S | ) = E | S | 2 −( E | S | ) 2 = ∏ ( n − i + 2) ( n − i ) − ∏ (n − i ) 2
Moments of sample generalized variance (n − 1) 2p
(n − 1) 2p
i =1 i =1
Σ
Since S is distributed as ( y1 y 2 L y p ) , where yi ~ χ n2−i , i = 1, 2,L , p , are Σ
2 p
(n − 1) p =
2p ∏ 2 (n − i) .
independent, thus (n − 1) i =1
h Note: For p = 2
h Σ
E S = E ( y1 ) h E ( y 2 ) h L E ( y p ) h . n−i n + 2h − 1 n + 2h − 2
(n − 1) ph h Γ + h h Γ Γ
h Σ 2h
2 2
Σ 22h 2 2
Consider, E S = ∏ =
(n − 1) 2h i =1 Γ(n − i ) / 2 (n − 1) 2h Γ(n − 1) / 2 Γ(n − 2) / 2
y ~ χυ2 , where, υ = n − 1 , then
n + 2h − 2 1 n + 2h − 2
hΓ + Γ
1 ∞ (υ / 2)+h −1 − y / 2 Σ 2 2h2 2 2 .
E ( y ) = ∫ y f ( y) dy =
h h
∫ y e dy =
2υ / 2 Γ(υ / 2) 0 (n − 1) 2h
Γ
n − 2 1 n− 2
+ Γ
2 2 2
Γ [(υ / 2) + h]
1 , because ∞ x m−1 e −λ x dx = Γm
=
2υ / 2 Γ(υ / 2) (1 / 2) (υ / 2)+h
∫0 λm
Using legendre’s duplication formula
1 π Γ(2α )
Γ α + Γ (α ) =
2 h Γ [(υ / 2) + h] 2 2 2α
= .
Γ(υ / 2) h 2h
Σ 2 (n + 2h − 2)
Therefore, E S
h
= π Γ ( n + 2h − 2) / 2
(n − 1) 2h π Γ ( n − 2) / 2 2 ( n − 2 )
n −1 n− p
2h Γ
h + h 2h Γ + h h
h Σ 2 L 2 Σ Γ ( n + 2 h − 2)
E S = = (4.14)
(n − 1) ph Γ(n − 1) / 2 Γ ( n − p) / 2 (n − 1) 2h Γ (n − 2)
Wishart distribution 81 82 RU Khan
σ 11 0 L 0 (υ −1) / 2 1 a
a
0 σ 22 L 0 p rij
(υ − p −1) / 2
p ii
exp −
2
∑ σ ii
i ii
Σ =
M M M M
= ∏ σ ii . f ( A) =
p ∏ υ/2 υ/2
2 (σ ii )
0 0 L σ pp
i =1
π p ( p −1) / 4
∏ Γ(υ − i + 1) / 2 i =1
i =1
Wishart distribution 83 84 RU Khan
(υ −1) / 2 1 a ii (υ + 2 h) / 2
p
rij
(υ − p −1) / 2
p
aii exp − = 2 (υ + 2h) p / 2 π p ( p −1) / 4 Σ ∏ Γ [(υ + 2h) − i + 1] / 2 .
∞ 2 σ ii
= ∏ ∫0 daii . i =1
p ( p −1) / 4
p
i =1 2υ / 2 (σ ii )υ / 2 Therefore,
π ∏ Γ(υ − i + 1) / 2
i =1 p
(υ + 2h) / 2
Consider 2 (υ + 2h) p / 2 π p ( p −1) / 4 Σ ∏ Γ [(υ + 2h) − i + 1] / 2
h i =1
E A =
1 aii p
aii(υ −1) / 2 exp − 2υp / 2 π p ( p −1) / 4 Σ
υ/2
∞ 2 σ ii da . Put aii = u , or d a = 2 σ du , then ∏ Γ(υ − i + 1) / 2
B=∫ ii i ii ii i i =1
0 2υ / 2 (σ ii )υ / 2 2 σ ii
p
h
∞ 2 ph / 2 Σ ∏ Γ (υ + 2h − i + 1) / 2
B = ∫ (ui ) (υ / 2) −1 exp(−ui ) du i = Γ(υ / 2), ∀ i. i =1
0 =
p
Hence, the density function of rij is ∏ Γ(υ − i + 1) / 2
i =1
p
(υ − p −1) / 2
rij ∏ Γ (υ / 2) (υ − p −1) / 2
[ Γ (υ / 2)] p 2 ph / 2 Σ
h
p
n−i
f (rij ) = i =1
=
rij
.
∏ Γ 2
+ h
p p i =1
p ( p −1) / 4 p ( p −1) / 4
= .
p
π ∏ Γ (υ − i + 1) / 2 π ∏ Γ (υ − i + 1) / 2
i =1 i =1 ∏ Γ (n − i ) / 2
i =1
h
Exercise: Find the E A directly from W (υ , Σ) .
Result:
Solution: We know that, if A ~ W (υ , Σ) , then
Let us suppose that x1 ,L, xn constitute a random sample of size n from a population whose
(υ − p −1) / 2 density at x is f ( x ; θ ) and Ω is the set of values, which can be taken on by the parameter θ
A 1
f ( A) = exp − tr AΣ −1 . (Ω is the parameter space for θ ) and the parameter space for θ is partitioned into the
p 2
υ p/2 p ( p −1) / 4 υ/2
2 π Σ ∏ Γ{(υ − i + 1) / 2} disjoint sets ω and ω ′ , according to the null hypothesis
i =1 H 0 : θ ∈ ω , against H A : θ ∈ ω ′ , if
Now
max L0
λ= is referred to as a value of the likelihood ratio statistic λ ,
∫A f ( A) dA = 1 max L
p where max L0 and max L are the maximum values of the likelihood function for all values
(υ − p −1) / 2 1 υ /2
⇒ ∫A A exp − tr AΣ −1 dA = 2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ{(υ − i + 1) / 2} of θ in ω and Ω respectively. Since max L0 and max L are both values of a likelihood
2 i =1 function, and therefore, never negative, it follows that λ ≥ 0 , also, since ω is a subset of the
(4.17) Ω , it follows that λ ≤ 1 , then the critical region of the form λ ≤ k , (0 < k < 1) , defines a
h h likelihood ratio test of the null hypothesis H 0 : θ ∈ ω against the alternative hypothesis
E A =∫ A f ( A) dA
A H A : θ ∈ ω ′ . Such that
(υ − p −1+ 2h ) / 2 Pr [λ ≤ k H0] = α .
A 1
= exp − tr AΣ −1 dA
p 2 Example: Let H 0 : µ = µ 0 , against H A : µ ≠ µ0
υ/2
2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ{(υ − i + 1) / 2}
i =1 On the basis of a random sample of size n from normal population with the known variance
Now using equation (4.17), we have σ 2 , the critical region of the likelihood ratio test is obtained as follows:
Since ω contains only µ 0 , it follows that µ = µ 0 , and Ω is the set of all real numbers, it
[(υ + 2h)− p −1] / 2 1
∫A A exp − tr AΣ −1 dA follows that µ̂ = x . Thus
2
Wishart distribution 85 86 RU Khan
n distribution of the likelihood ratio statistic λ , itself. Since the distribution of λ is generally
1 1 n
2 ∑ i
max L0 = exp − ( x − µ0 )2 quite complicated, which makes it difficult to evaluate k , it is often preferable to use the
σ 2π 2 σ i=1 following approximation. For large n , the distribution of − 2 ln λ approaches, under very
1
n 1 n general conditions the chi-square distribution with 1 degree of freedom, i.e. − 2 ln λ ~ χ12 . In
2 ∑ i
and max L = exp − ( x − x ) 2 , then the value of the likelihood ratio above example, we find that
σ 2π 2 σ i =1
statistic becomes, 2
n x − µ0
− 2 ln λ = ( x − µ 0 ) 2 = .
1 n σ 2
σ / n
2 ∑ i
exp − (x − µ0 ) 2
2 σ i =1 n
λ= = exp − ( x − µ0 ) 2 Testing independence of sets of variates
1 n 2 σ
2
exp − ∑ ( xi − x ) 2 Let the p − component vector X be distributed according to N p ( µ , Σ) . We partition X into
2 σ 2 i=1
q sub vectors with p1 , p 2 , L , p q components respectively, that is
Hence, the critical region of the likelihood ratio test is
X (1)
n
exp − ( x − µ 0 ) 2 ≤ k , since λ ≤ k X = M , p1 + p2 + L + p q = p . The vector of mean µ and the covariance matrix
2
2 σ (q)
X
n
and after taking logarithms and dividing by , it becomes Σ are partitioned similarly,
2σ 2
µ (1) Σ11 L Σ1q
2 2σ 2
(x − µ0 ) ≥ − ln k , ln k is negative in view of 0 < k < 1 µ = M , Σ = M M M .
n (q )
µ Σ q1 L Σ qq
or x − µ0 ≥ K ,
where K will have to be determined so that the size of the critical region is α i.e. H 0 : X (1) , X ( 2) ,L , X ( q ) are mutually independently distributed or equivalently Σ ij = 0 ,
Pr [ x − µ 0 ≥ K ] = α . (a) for i ≠ j , where Σ ij = E ( X (i) − µ (i) ) ( X ( j ) − µ ( j ) )′ .
Since x has a normal distribution with mean µ 0 and variance σ 2 / n , i.e. Under H 0 , Σ is of the form
x − µ0
x ~ N ( µ 0 , σ 2 / n) , then Z = ~ N (0, 1) . Σ11 L 0
σ/ n
Σ= M M M = Σ 0 (say)
For the given value of α , we find a value Z α (say) of standard normal variate by the 0 L Σ qq
following equation
σ Let x1 , x 2 ,L, x n be n independent observations on X , the likelihood ratio criterion is
Pr [ Z ≥ Z α / 2 ] = α or Pr x − µ 0 ≥ Zα / 2 = α (b)
n max L( µ , Σ 0 ) max L0
λ= = , (4.18)
From equation (a) and (b), we get max L( µ , Σ) max L
σ where
K= Zα / 2 .
n
1 1
Therefore, the critical region of the likelihood ratio test is L ( µ , Σ) = exp − ∑ ( x α − µ )′ Σ −1 ( x α − µ ) ,
np / 2 n/2 2
(2π ) Σ α
σ
Pr x − µ 0 ≥ Zα / 2 = α .
n and the numerator of (4.18) is the maximum of L for µ , Σ ∈ ω restricted by H 0 and the
Note: It was easy to find the constant that made the size of the critical region equal to α , denominator is the maximum of L over the entire parametric space Ω . Now consider the
because we were able to refer to the known distribution of x , and did not have to derive the likelihood function over the entire parametric space
Wishart distribution 87 88 RU Khan
n np / 2 1 A −1 q
= exp − tr ∑ ( x α − x ) ( x α − x )′ p 3 − ∑ pi3
(2π ) np / 2 A
n/2 2 n α p ( p + 1) p ( p + 1) 1 2
q 3 q
f = −∑ i i = p − ∑ pi2 , and m = n − − i =1
.
2 2 2 2 2 q 2
3 p − ∑ pi
i =1 i =1
n np / 2 1
= exp − tr n I
(2π ) np / 2
A
n/2 2 i =1
Exercise: Show that the likelihood ratio criterion λ for testing independence of sets of
n np / 2 1 vectors can be written as
= exp − np , where trace nI p× p = np
(2π ) np / 2 A
n/2 2
n/2 R11 L R1q
R
Similarly, under H 0 , the likelihood function is λ= , where R = (r jk ) = M M M .
q
n/2 Rq1 L Rqq
1 1 ∏ Rii
max Lω = exp − ∑ ( x α − µˆ )′ Σˆ 0−1 ( x α − µˆ ) i =1
n/2 2
(2π ) np / 2 Σˆ 0 α n/2
A
Solution: We have, λ = =V n/2 .
q
1 1 q
n/2
=∏ exp − ∑ ( x α(i) − µˆ (i ) ) ′ Σˆ ii−1 ( x α(i) − µˆ (i ) ) ∏ Aii
n / 2
i =1 ( 2π ) npi / 2 Σ
ˆ ii 2 α i =1
Define,
q
1 1
=∏ exp − npi a jk
n/ 2 2 r jk = , ⇒ a jk = r jk a jj a kk
i =1 A
(2π ) npi / 2 ii a jj a kk
n
p p1 + L+ pi −1 + pi
1 1 ⇒
= q
exp − np . A = R ∏ a jj , where p = p1 + p 2 + L + p q , and Aii = Rii ∏ a kk
1 2 j =1 k = p1 + L+ pi −1 +1
(2π ) np / 2 ∏ np / 2 Aii n / 2
n i
i =1 p p
Thus, the likelihood criterion becomes R ∏ a jj R ∏ a jj
A j =1 j =1 R
1 n/2 ⇒ V= = = = .
A n/2
q q p1 +L+ pi −1 + pi q p q
λ= n np / 2
q
=
q
A
=V n/2 ∏
i =1
Aii ∏ Rii ∏ akk ∏
i =1
Rii ∏ a kk ∏
k =1 i =1
Rii
1 n/2 n/2 i =1 k = p1 + L+ pi −1 +1
∏ Aii ∏ Aii
n np / 2 i =1 i =1 n/2
R
Thus, λ = .
and H 0 : Σ ij = 0 is rejected if λ < λ0 , where λ0 is so chosen so as to have level α . q
n/2
Limiting results
∏ Rii
i =1
By the large sample general result about likelihood ratio criterion is Exercise: Let Ci be an arbitrary nonsingular matrix of order pi and let
− 2 ln λ ~ χ 2f approximately, where,
C1 L 0
*
p ( p + 1) 1 2 q 2
q C = M M M and x α = C xα + d .
p ( p + 1)
f = −∑ i i = p − ∑ pi .
0 L C
2 i =1
2 2 i =1
q
Wishart distribution 89 90 RU Khan
*
Show that the likelihood ratio criterion for independence interms of x α is identical to the and the likelihood ratio criterion is
criterion interms of x α . max L( µ , Σ 0 ) max L0
λ= =
max L( µ , Σ) max L
A
Solution: The likelihood ratio criterion for independence interms of x α is V = , the numerator is the maximum of L for µ , Σ ∈ ω restricted by H 0 and the denominator is
q
∏ Aii the maximum of L over the entire parametric space Ω . Now
i =1
define,
n q n
A
−1
1 n 1 1 i
exp − ∑ ( x α(i ) − x (i) )′ i (i ) (i )
A* = ∑ ( xα* − x * ) ( x*α − x * )′ , *
where, x α = ∑ (C x α + d ) = C x + d max LΩ = ∏
ni / 2
2 ni
( x α − x )
α =1 n α =1 i =1 ni p / 2 Ai α =1
(2π )
n ni
A* = ∑ C ( xα − x ) ( xα − x ) ′ C ′ = C A C ′ .
α =1
Similarly, q
1 1 1 1
= ∏ exp − ni p = exp − np .
n A
ni / 2 2 q
A
ni / 2 2
Aij* = ∑ ( xα*(i ) − x *(i) ) ( x*α( j ) − x *( j ) )′ i =1
(2π ) ni p / 2 i
(2π ) np / 2 ∏ i
α =1 ni n
i =1 i
n Similarly, under H 0 , the likelihood function is
= C i ∑ ( x α(i) − x (i) ) ( x α( j ) − x ( j ) ) ′ C j ′ = C i Aij C j ′ .
α =1 q 1 ni
1
Thus, max Lω = ∏ exp − ∑ ( x α(i ) − µˆ (i) )′ Σˆ −1 ( x α(i ) − µˆ (i ) )
ˆ ni / 2
i =1 ( 2π ) ni p / 2 Σ 2 α =1
A*
C AC′
V* = = 1 q ni −1
1 A
exp − ∑ ∑ ( x α(i) − x (i ) )′ ( x α(i ) − x (i ) )
q q
Aii* C i Aii C i ′ =
∏ ∏ A
n/2 2
i =1 α =1 n
i =1 i =1 (2π ) np / 2
n
C A C′ A q
= = = V , because ∏ Ci = C . 1 1
q q = exp − np .
Aii C i ′ 2
i =1 n/2
∏ Ci ∏ Aii
(2π ) np / 2
A
i =1 i =1 n
Therefore, the test is invariant with respect to linear transformation within each set. Thus, the likelihood criterion becomes
ni / 2 q
Testing equality of covariance matrices q
A 1 ni / 2 q
∏ n /2
∏ ni ni p / 2
Ai
np / 2 ∏ i
A i
Let x α(i ) α = 1, 2, L , ni ; i = 1, 2, L , q be an observation from N p ( µ (i ) , Σ i ) i i =1 ni n
λ = i =1 = = i =1 .
n/2 1 q n/2
A A
n/2 ni p / 2 A
H 0 : Σ1 = Σ 2 = L = Σ q = Σ (say), versus H A : Σ i ≠ Σ j , for i ≠ j .
n n np / 2 ∏ ni
i =1
Let
Bartlett has suggested that the likelihood ratio criterion be modified by replacing the sample
ni q
size ni by the degree of freedom υ i = ni − 1 , the modified criterion is
Ai = ∑ ( xα (i )
−x (i ) (i )
) ( xα − x (i)
)′ , and A = ∑ Ai .
α =1 i =1 q q
υ /2 υ /2
υp / 2 ∏ i ∏
The likelihood function is A i Si i
υ q
q
λ* = i =1 = i =1 , where, υ = ∑ υ i = n − q and
1 1 ni q
υi p / 2 A
υ/2 S
υ / 2
L ( µ , Σ) = ∏ exp − ∑ ( x α(i ) − µ (i ) ) ′ Σ i−1 ( x α(i) − µ (i) ) ∏υi
i =1
i =1 ( 2π ) i
n p/2 n / 2
Σi i 2 α =1 i =1
Wishart distribution 91 92 RU Khan
max L( µ , Σ 0 ) max L0
B= ∑ ( y α − y ) ( yα − y )′ , when the null hypothesis is true, then, − 2 ln λ ~ χ 2p( p +1)+ p .
λ= = , α =1
max L( µ , Σ) max L
Proof: Given y ~ N p (υ , Φ ) , and H 0 : υ = υ 0 , Φ = Φ 0 .
α
the numerator is the maximum of L for µ , Σ ∈ ω restricted by H 0 and the denominator is
Let
the maximum of L over the entire parametric space Ω ,
X = C (Y − υ 0 ) , with C is a nonsingular matrix
where,
1 E X = 0 , and Σ X = C Φ 0−1 C ′ = I , under H 0 .
1
L ( µ , Σ) = exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ )
n/2
(2π ) np / 2
Σ 2 α ⇒ C ′C = Φ 0−1 ,
The likelihood function over the entire parametric space is then x1 , x 2 ,L , x n constitutes a sample from N p ( µ , Σ) and the hypothesis is
1 1 A
−1 H0 : µ = 0 , Σ = I
max LΩ = exp − ∑ ( x α − x )′ ( x α − x )
A
n/2 2 α n
(2π ) np / 2 n np / 2 1
n max LΩ = exp − np , and
(2π ) np / 2
A
n/2 2
n np / 2 1
= exp − np . 1
1
2 exp − ∑ xα ′ x α .
np / 2 n/2
(2π ) A max Lω =
(2π ) np / 2 2
α
The likelihood function over ω , the parametric space as restricted by H 0 : Σ = I is
Thus, the likelihood ratio criterion is
1 1
max Lω = exp − ∑ ( x α − x )′ I ( x α − x ) 1
np / 2 n/2
2 α exp − ∑ x α ′ x α
(2π ) I np / 2 1
max L( µ , Σ 0 ) 2 α =e n/2
exp − ∑ x α ′ x α .
λ= = A
np / 2 2
1 1 1 1
max L( µ , Σ) n 1 n α
= exp − tr I ∑ ( x α − x ) ( x α − x )′ = exp − tr A . exp − np
np / 2 2 np / 2 2
n/2 2
(2π ) α (2π ) A
Wishart distribution 93
∑ xα ′ xα = ∑ ( y α − υ 0 )′ C ′ C ( yα − υ 0 ) = ∑ ( yα − υ 0 )′ Φ 0−1 ( yα − υ 0 )
α α α
= tr ∑ ( y − υ 0 )′ Φ 0−1 ( y
α α
− υ 0 ) = tr Φ 0−1 ∑ ( y
α
−υ0)(y
α
− υ 0 )′
α α
= tr Φ 0−1 ∑ ( y − y ) ( y − y )′ + n ( y − υ 0 ) ( y − υ 0 )′
α α
α
= tr Φ −0 1 [ B + n ( y − υ 0 ) ( y − υ 0 ) ′] = tr (Φ −0 1 B) + tr Φ 0−1 n ( y − υ 0 ) ( y − υ 0 ) ′
= tr (Φ −0 1 B) + n ( y − υ 0 ) Φ 0−1 ( y − υ 0 )′
and
A = ∑ ( x α − x ) ( x α − x )′ = ∑ [{C ( y − υ 0 ) − C ( y − υ 0 )}{C ( y − υ 0 ) − C ( y − υ 0 )}′ ]
α α
α α
= ∑ [C ( y − y )] [C ( y − y )]′ = C ∑ ( y − y) ( y − y )′C ′ = C B C ′ .
α α α α
α α
Thus,
B
A = C B C′ = = B Φ 0−1 , because C Φ 0 C ′ = I , then C Φ 0 C ′ = 1 , and
Φ0
1
C C′ = Φ0 =1 ⇒ C C′ = .
Φ0
Therefore, the likelihood ratio criterion becomes
np / 2
e n/2 1
λ = B Φ 0−1 exp − [tr B Φ 0−1 + n ( y − υ 0 )′ Φ 0−1 ( y − υ 0 )] .
n 2
96 RU Khan
MULTIPLE AND PARTIAL CORRELATIONS Differentiating with respect to β and equating to zero
Result: − 2 E ( X (2) − µ ( 2) ) [ ( X 1 − µ1 ) − β ′ ( X ( 2) − µ ( 2) )] = 0
or E ( X 1 , X 2 ) = a µ 2 + b E ( X 22 ) V ( Xˆ 1 ) = E [ Xˆ 1 − E ( Xˆ 1 )][ Xˆ 1 − E ( Xˆ 1 )]′
If X 1 is the first component of X and X (2) the vector of remaining ( p − 1) components. σ σ 12′
where σ 12 and Σ 22 are defined as 11 .
We first express X 1 as a linear combination of X (2) defined by the relation σ
21 Σ 22
X * = µ + β ′ ( X ( 2) − µ (2) ) , the coefficient vector β is determined by minimizing
1 1 Note: Since the numerator is V ( Xˆ 1 ) , therefore, ρ1( 2, 3,L, p) ≥ 0 i.e. 0 ≤ ρ1(2, 3,L, p ) ≤ 1 .
U = E [ X1 − X 1* ] 2 = E [ X 1 − µ1 − β ′ ( X (2) − µ (2) )]2 . This is so because X̂ 1 is an estimate of X 1 .
Multiple and partial correlations 97 98 RU Khan
A n −1
Given x α (α = 1,L , n) , n > p . We estimate Σ by Σˆ = = S, Σ11.2 = σ 11 − σ 12 ′ Σ −221 σ 21 = σ 11 , since σ 12′ = 0 , so that
n n
where, A = ∑ ( x α − x ) ( x α − x )′ . a11 − a12′ A22
−1
a 21 ~ W1 (n − 1 − ( p − 1), σ 11 )
α
a11 − a12′ A22
−1
a 21
Now A is partitioned as follows ⇒ ~ χ n2− p .
σ 11
a a12′
11
a ′ A −1 Consider
n , and the estimate of β is βˆ = σ 12′ Σ −1 = 12 22
A n ′
= = a12 ′ A22
−1
.
n a12 A22 22 n n
a11 a11 − a12′ A22
−1
a 21 a ′ A −1 a
n n = + 12 22 21
σ 11 σ 11 σ 11
Using the above estimate, the sample multiple correlation coefficient of X 1 on X 2 , L , X p is
or Q = Q1 + Q2 , (say),
σˆ 12′ Σˆ −221 σ 12 a ′ A −1 a
where Q ~ χ n2−1 , and Q1 ~ χ n2− p .
R1( 2,L, p ) = = 12 22 12
σˆ11 a11
and From Fisher Cochran theorem Q2 is independently distributed as χ n2−1−( n− p) , i.e.
The sample multiple correlation coefficient between X 1 and X (2) is defined by relation The distribution of the statistic F is
ν1 / 2 ν1
a ′ A −1 a a11 − a12′ A22
−1
a12 ν1
F2
−1
R 2 = 12 22 12 and 1 − R 2 = ,
a11 a11 df ( F ) = ν 2 dF ,
(ν1 +ν 2 ) / 2
ν ν ν1
a a12′ B 1 , 2 1 + F
where R 2 = R12( 2, 3,L, p ) and A = 11 . 2 2 ν 2
a A22
12
where ν 1 = p − 1 , ν 2 = n − p .
Therefore,
R2 ν 2 dR 2 ν 2
R2 a12′ A22
−1
a12 In this put F = , then dF =
= . 1 − R 2 ν1 (1 − R 2 ) 2 ν 1
1− R2 a11.2
ν1
We know that, if A is partitioned as ν1 / 2 −1
ν1 2
ν2 2
R
A11 A12 q ν 1 − R 2 ν1
A = and A ~ W p (n − 1, Σ) , then, A11 ~ Wq (n − 1, Σ11 ) and 2 ν 2 dR 2
A22 p − q df ( R 2 ) =
A21 (ν 1 +ν 2 ) / 2 ν 1 (1 − R 2 ) 2
ν ν R2
B 1 , 2 1+
1 − R 2
−1
A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11.2 ) . 2 2
Multiple and partial correlations 99 100 RU Khan
ν1 ν 1 ν1 Similarly,
− +1−1 −1
ν1 2 2 R2 2
ν 1 − R 2 max Lω =
n np / 2 1
exp − Q , where
2 dR 2 np / 2
= (2π ) (a11 A22 ) n / 2 2
(ν 1 +ν 2 ) / 2
ν ν 1 (1 − R 2 ) 2
B 1 , 2 n a −1
2 2 1 − R 2 0 a11 0
Q = trace 11
−1 0
= tr nI = np , thus,
0
n A22 A22
ν1 ν1 +ν 2 ν1
1 −1 − +1−2
= (R 2 ) 2 (1 − R 2 ) 2 2 dR 2 n np / 2
ν ν 1
B 1 , 2 max Lω = exp − np .
2 2 (2π ) np / 2
(a11 A22 ) n / 2 2
ν1 ν2 Hence, the ratio
1 −1 −1
= (R 2 ) 2 (1 − R 2 ) 2 dR 2 . Put dR 2 = 2 R dR , thus the distribution of R . n/2
ν ν A A
B 1 , 2 λ= ⇒ λ2 / n = =1− R2
2 2 (a11 A22 ) n / 2 a11 A22
ν2 n− p
−1 −1 The likelihood ratio test is defined by the critical region λ ≤ λ0 , where
2 R (ν1−1) (1 − R 2 ) 2 2 R p −2 (1 − R 2 ) 2
df ( R) = dR = dR , 0 < R < 1.
ν ν p −1 n − p Pr[λ ≤ λ0 H 0 ] = α
B 1 , 2 B ,
2 2 2 2
⇒ λ2 / n ≤ λo2 / n ⇒ 1 − λ2 / n = R 2 > 1 − λo2 / n = R02 , (say)
Likelihood ratio criteria for testing H 0 : ρ1(2, 3,K, p ) = 0 . 2
R2 n− p R0 n − p
The likelihood function of the sample x α (α = 1, 2, K , n > p ) from N ( µ , Σ ) is ⇒ >
2 p −1
1− R 1 − R02 p − 1
1 1 ⇒ F > F p −1, n − p (α ) .
L( µ , Σ) = exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ )
np / 2 n/2
(2π ) Σ 2 α
Theorem: Multiple correlation is invariant under the non-singular linear transformation.
and the likelihood ratio criterion Proof: We know that
λ=
max L0
, where the numerator is the maximum of L for µ , Σ ∈ ω restricted by σ 12′ Σ 22
−1
σ 12 X1 σ σ 12′
max L ρ12(2,L, p ) = , where X = ( 2) , and Σ = 11
σ 11 X σ Σ 22
H 0 : ρ1(2, 3,K, p ) = 0 (i.e. β = 0 , ⇒ σ 12 = 0) and the denominator is the maximum of L 12
over the entire parametric space Ω . Let
Now Y1 = a11 X 1 , and Y (2) = A22 X ( 2)
1 1 A
−1 a 0
max LΩ = exp − ∑ ( x α − x ) ′ ( x α − x ) or Y = 11 X , where a11 ≠ 0 , A22 ≠ 0 .
A
n/2 2 α n 0 A22
(2π ) np / 2
n Assume that E X = 0 , and
np / 2 1 A −1
n
exp − tr ∑ ( x α − x ) ( x α − x ) ′ a 0 σ 11 σ 12′ a11 0 a 2 σ 11 a11 A22 σ 12 ′
= ΣY = 11 = 11
(2π ) np / 2 A
n / 2 2 n α 0 A22 σ 12 Σ 22 0 A22 a11 A22 σ 12 A22 Σ 22 A22
n np / 2 1 Therefore,
= exp − np ,
2 a11 A22 σ 12 ′ ( A22 Σ 22 A22 ) −1 a11 A22 σ 12
np / 2 n/2
(2π ) A
ρ12( 2, 3,K, p) (Y ) =
2
where trace nI p× p = np . a11 σ 11
Multiple and partial correlations 101 102 RU Khan
σ ′ Σ −1 σ X1 σ σ 12′
= 12 22 12 = ρ12( 2, 3,K, p ) ( X ) . Exercise: Let X ~ N (0, Σ) , X = (2) , and Σ = 11 . The difference
σ X σ
11 12 Σ 22
X 1 − σ 12′ Σ −221 X ( 2) (called the residual of X 1 from its mean square regression line on
X σ σ 12′
Theorem: Let X ~ N (0, Σ) , X = (12) , and Σ = 11 , then of all linear
X (2) ) is uncorrelated with any of the independent variable X 2 , X 3 , K , X p .
X σ 12 Σ 22
combinations of X 2 , X 3 , K , X p has the maximum correlation with X 1 . Solution:
Proof: We know that, for any c nonzero scalar and γ Cov [( X 1 − σ 12′ Σ −221 X ( 2) ), X (2) ] = E[( X 1 − σ 12 ′ Σ −221 X (2) ) ( X ( 2) )′]
E ( β ′ X ( 2) ) 2 A11 A12 q
EX1(γ ′ X ( 2) ) A = , then, A11 ~ Wq (n − 1, Σ11 ) , and
−2
A21 A22 p − q
E (γ ′ X ( 2) ) 2
−1
A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11.2 ) .
E ( β ′ X ( 2) ) 2
or − 2 EX1( β ′ X ( 2) ) ≤ −2 EX1(γ ′ X ( 2) ) At p = 2 , and q = 1
E (γ ′ X ( 2) )2
a11
a11 ~ W1 (n − 1, σ 11 ) , ⇒ ~ χ n2−1 .
E ( β ′ X ( 2) ) 2 σ 11
EX 1 ( β ′ X ( 2) ) ≥ EX 1 (γ ′ X ( 2) )
E (γ ′ X ( 2) ) 2 In null case ρ = 0
−1
Dividing both the sides by EX 12 E ( β ′ X ( 2) ) 2 , we get σ 11.2 = σ 11 − σ 12σ 22 σ 21 = σ 11 , since σ 12 = 0 .
So that
EX 1 ( β ′ X (2) ) EX 1 (γ ′ X ( 2) )
≥ −1
a11 − a12 a 22 a 21 ~ W1 (n − 1 − (2 − 1), σ 11 )
EX 12 E ( β ′ X (2) ) 2 EX 12 E (γ ′ X ( 2) ) 2
−1
a11 − a12 a 22 a 21
Corr. ( X 1 , β ′ X (2) ) ≥ Corr. ( X 1 , γ ′ X ( 2) ) . ⇒ ~ χ n2− 2 .
σ 11
Multiple and partial correlations 103 104 RU Khan
a11
−1
a11 − a12 a 22 a 21 −1
a12 a 22 a 21 V ( X 2.3K p ) = σ 22 − σ 23′ Σ 33
−1
σ 23 , and
= +
σ 11 σ 11 σ 11
Cov ( X 1.3K p , X 2.3K p ) = E[ X 1 − σ 13′ Σ 33 X ][ X 2 − σ 23′ Σ 33
−1 (3) −1 (3) ′
X ]
or Q = Q1 + Q2 , (say),
= σ 12 − σ 13′ Σ 33
−1
σ 23 .
where Q ~ χ n2−1 and Q1 ~ χ n2−2 , therefore, Q2 ~ χ12 , by the additive property of χ 2 . Thus,
Therefore,
−1
r2 n − 2 a12 a 22 a 21 / σ 11 n − 2 χ12
× = × = ~ F1, n − 2 . σ 12 − σ 13′ Σ 33
−1
σ 23
1− r2 1 a11.2 / σ 11 1 2
χn−2 / n − 2 ρ12 .3K p = .
(σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 )
Hence,
r Alternative proof
( n − 2) ~ t n−2 .
2
1− r Consider the var. cov. matrix of conditional distribution of X (1) given X (2) , which is
Partial correlation −1
Σ11.2 = Σ11 − Σ12 Σ 22 Σ 21 , in our case
The correlation between two variables X 1 and X 2 is measured by the correlation coefficient
σ 11 σ 12 σ 13′ −1
which sometimes called the total correlation coefficient between X 1 and X 2 . If X 1 and X 2 = − Σ (σ σ 23 )
′ 33 13
σ 21 σ 22 σ 23
are considered in the conjunction with p − 2 other variables X 3 , X 4 , K , X p , we may regard
the variation of X 1 and X 2 as to certain extents due to the variation of the other variables. σ 11 σ 12 σ 13′ Σ 33
−1
σ 13 σ 13′ Σ 33
−1
σ 23
= −
σ 21 σ 22 σ 23 Σ 33 σ 13 σ 23 Σ 33 σ 23
Let X 1.3K p and X 2.3K p represent these parts of variation of X 1 and X 2 respectively, ′ − 1 ′ − 1
which remains after subs traction of the best linear estimate in terms of X 3 , X 4 , K, X p .
σ − σ ′ Σ −1 σ ′ −1 σ σ 12 . 3 K p
= 11 13 33 13 σ 12 − σ 13 Σ 33 σ 23 = 11. 3K p
Thus we may regards the correlation coefficient between X 1.3K p and X 2.3K p as a measure σ − σ Σ σ
′ − 1 ′ − 1 σ σ 22 . 3K p
12 13 33 23 σ 22 − σ 23 Σ 33 σ 23 12 . 3K p
of correlation between X 1 and X 2 after removal of any part of the variation due to the
influence of X 3 , X 4 , K , X p , this correlation is called partial correlation of X 1 and X 2 with We know that simple correlation coefficient is
σ 12 σ σ 12
respect to X 3 , X 4 , K , X p . Let ρ= , where σ 11 , σ 12 , and σ 22 are defined as Σ = 11 .
σ 11 σ 22 21 σ 22
σ
σ ′
X1 11 σ 12 M σ 13 Therefore, the partial correlation coefficient is obtained like simple correlation coefficient
σ σ 22 M σ 23′ −1
X = X 2 , µ = 0 , and Σ = 12 from Σ11.2 = Σ11 − Σ 21Σ 22 Σ12
X (3) L L L L
σ σ 12. 3K p σ 12 − σ 13′ Σ 33
−1
σ 23
13 σ 23 M Σ 33 ρ12.3K p = = .
σ 11.3K p σ 22. 3K p (σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 )
and the best linear estimates of X 1 and X 2 interms of X (3)
are Xˆ 1 = σ 13′ Σ 33
−1 (3)
X , and
Xˆ = σ ′ Σ −1 X (3) respectively.
2 23 33
In general,
Let the partition of X and Σ as follows
Define,
X1
X 1.3K p = X 1 − X̂ 1 , and X 2.3K p = X 2 − X̂ 2 , then
M
V ( X 1.3K p ) = E [ X 1 − σ 13′ Σ 33 X ][ X 1 − σ 13′ Σ 33
−1 (3 ) −1 (3) ′ X
q X
X ] (1)
Σ11 Σ12
X = = , and Σ = .
X q +1 ( 2)
−1 (3) (3) ′ −1 X Σ 21 Σ 22
= E[ X 12 − 2 σ 13′ Σ 33
−1
X 1 X (3) + σ 13′ Σ 33 X X Σ 33 σ 13 ]
M
= σ 11 − 2 σ 13′ Σ 33
−1
σ 13 + σ 13′ Σ 33
−1 −1
Σ 33 Σ 33 σ 13 = σ 11 − σ 13′ Σ 33
−1
σ 13 . Xp
Multiple and partial correlations 105 106 RU Khan
The partial correlation coefficient between the variable X i and X j ( X i , X j ∈ X (1) ) holding The distribution of the sample partial correlation rij . q +1K p based on a sample of size n from
a distribution with population correlation ρ ij . q+1K p is same as the distribution of ordinary
the components of X (2) fixed (there will be total of q C2 partial correlation coefficients), is
often denoted by correlation coefficient rij based on a sample of size n − ( p − q ) from a distribution with the
σ ij . q +1K p corresponding population partial correlation ρ ij . q +1K p = ρ . Thus,
ρ ij . q +1K p = , i, j = 1, 2,K , q .
(σ ii . q +1K p ) (σ jj . q +1K p ) r
n − ( p − q) − 2 ~ t n −( p − q ) − 2 .
1− r2
Estimation of partial correlation coefficient
We know that the population partial correlation coefficient between X i and X j holding the Test of hypothesis and confidence region for partial correlation coefficient
Case I: H 0 : ρ ij.q +1L p = ρ 0 , (a specified value), against H A : ρ ij.q +1L p ≠ ρ 0 . For
components of X (2) fixed, is given by
testing H 0 on the basis of sample of size n − ( p − q ) , we use the following test statistic
σ ij . q +1K p
ρ ij . q +1K p = . Z − ξ0
(σ ii . q +1K p ) (σ jj . q +1K p ) U= ~ N (0, 1) , where
1/ n − 3
Given a sample xα (α = 1, 2, K , n > p ) from N ( µ , Σ) , the maximum likelihood estimate of 1 1 + rij.q +1L p 1 1 + ρ0
Z = ln = tan h −1 rij.q+1L p , and ξ 0 = ln = tan h −1 ρ 0 .
ρ ij . q+1K p is 2 1 − rij.q +1L p 2 1 − ρ0
σˆ ij . q +1K p If the absolute value of U is greater than 1.96 , we reject H 0 at α = 0.05 level of
ρˆ ij . q +1K p = ,
(σˆ ii . q +1K p ) (σˆ jj . q +1K p ) significance otherwise accept H 0 . For confidence region when ρ ij.q +1L p is unknown,
then we can write
aij . q +1K p
i.e. rij . q +1K p = is called the sample partial correlation coefficient Pr [−U α / 2 ≤ (n − 3) (Z − ξ ) ≤ U α / 2 ] = 1 − α , where
(a ii . q +1K p ) (a jj . q +1K p )
1 1 + ρ ij.q +1L p
between X i and X j holding X q +1 , L , X p fixed, ξ = ln = tan h −1 ρ ij.q +1L p , then
2 1 − ρ ij.q +1L p
−1
where, (aij . q +1K p ) = A11 − A12 A22 A21 = A11.2 , i.e. aij . q +1K p is the i − th and j − th
Uα / 2 Uα / 2
A A12 Pr Z − ≤ tan h −1 ρ ij.q +1L p ≤ Z + =1− α
element of A11.2 , and A = 11 . (n − 3) (n − 3)
A21 A22
U α / 2 U α / 2
Distribution of sample partial correlation coefficient or Pr tan h Z − ≤ ρ ij.q +1L p ≤ tan h Z + =1− α .
(n − 3) (n − 3)
The sample partial correlation coefficient between X i and X j holding X q +1 , L , X p fixed
Case II: H 0 : ρ ij.q +1L p = 0 , against H A : ρ ij.q +1L p ≠ 0 . For testing H 0 on the basis of
is defined as
sample of size n − ( p − q ) , we use the following test statistic
aij . q +1K p
rij . q +1K p = , rij.q +1L p
(a ii . q +1K p ) (a jj . q +1K p ) t= n − ( p − q ) − 2 ~ t n −( p − q ) − 2 .
1 − rij2.q +1L p
−1
where, (aij . q +1K p ) = A11 − A12 A22 A21 = A11.2 , i.e. aij . q +1K p is the i − th and j − th
If t > t n−( p −q )−2 (α ) , we reject H 0 at α level of significance otherwise accept H 0 .
A A12
element of A11.2 , and A = 11 .
A21 A22 Theorem: Partial correlation coefficient is invariant under the nonsingular linear
transformation.
Let
Proof: We know that
A = ∑ ( x α − x ) ( x α − x ) ′ ~ W p (n − 1, Σ) , then,
α σ 12 − σ 13′ Σ 33
−1
σ 23
ρ12 .3K p ( X ) = ,
−1
A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11.2 ) . (σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 )
Multiple and partial correlations 107 108 RU Khan
σ ′ Solution:
11 σ 12 σ 13
where Σ X = σ 12 σ 22 σ 23′ .
i) We are given that
σ 13 σ 23 Σ 33 ρ ij = ρ , ij = 1, 2, L , p ; i ≠ j , we have
−1
Let σ ij − σ ik σ kk σ jk
ρ ij . k =
(3) (3) −1 −1
Y1 = a11 X 1 , a11 ≠ 0 , Y2 = a 22 X 2 , a 22 ≠ 0 , and Y = A33 X , A33 ≠ 0 σ ii − σ ik σ kk σ ik σ jj − σ jk σ kk σ jk
a11 0 0 1
σ iσ j ρ ij − σ iσ k ρ ik σ jσ k ρ jk
or Y = 0 a 22 0 X . σ kσ k
=
0 0 A33 1 1
σ ii − σ iσ k ρ ik σ iσ k ρ jk σ jj − σ j σ k ρ jk σ jσ k ρ jk
σ kσ k σ kσ k
Assume that E X = 0 , and
σ iσ j ( ρ ij − ρ ik ρ jk ) ρ − ρ2 ρ
′
a11 0 0 σ 11 σ 12 σ 13 a11 0 0 = = =
1+ ρ
.
σ iσ j 1 − ρ ik2 1 − ρ 2jk 1− ρ 2
1− ρ 2
ΣY = 0 a 22 0 σ 12 σ 22 σ 23′ 0 a 22 0
0
0 A33 σ 13 σ 23 Σ 33 0 0 A33 ρ
Thus every partial correlation coefficient of order 1 is . Similarly,
1+ ρ
a2 σ a11a 22σ 12 a11σ 13′ A33
11 11 2
ρ ρ ρ ρ
= a11a 22σ 12 2
σ 22 a 22 σ 23′ A33 − 1 −
1 + ρ 1 + ρ
a 22
ρ ij.l − ρ ik.l ρ jk.l 1+ ρ 1+ ρ ρ
a11σ 13 A33 A33 σ 23 a 22 A33 Σ 33 A33 ρ ij.kl = = = =
1− ρ ik2 .l 1− ρ 2jk.l ρ
2 ρ ρ 1 + 2ρ
1 − 1 + 1 −
by the analogy ρ12 .3K p ( X ) 1 + ρ 1+ ρ 1+ ρ
ρ
a11a 22σ 12 − a11σ 13′ A33 ( A33Σ 33 A33 ) −1 A33 σ 23 a 22 Thus every partial correlation coefficient of order 1 is .
ρ12.3K p (Y ) = 1 + 2ρ
a 2 σ − a σ ′ A ( A Σ A ) −1 A σ a
11 11 11 13 33 33 33 33 33 13 11 The partial correlation coefficient of the highest order in p − variate distribution is p − 2 ,
1 by the method of induction, the every partial correlation coefficient of order p − 2 is
× . ρ
2
a 22 σ 22 − a 22 σ 23′ A33 ( A33Σ 33 A33 ) −1 A33 σ 23a 22 1 + ( p − 2) ρ
. Since ρ ij.( p−1) components ≤ 1 , so that
σ 12 − σ 13′ Σ 33−1
σ 23 We have on considering the lower limit
ρ12.3K p (Y ) = = ρ12. 3K p ( X ) .
(σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 ) −1≤
ρ
, or − (1 + pρ − 2 ρ ) ≤ ρ or − 1 ≤ ρ + pρ − 2
1 + ( p − 2) ρ
Note:
1
or − 1 ≤ ρ ( p − 1) or ρ≥− .
1 − ρ12( 2, 3, L, p) = 2
(1 − ρ12 2
) (1 − ρ13 2 2
.2 ) (1 − ρ14.23 ) L (1 − ρ1 p.23L p −1 ) . p −1
ii) We know that
Exercise: If all the total correlation coefficient in a p − variate normal distribution are
equal to ρ ≠ 0 , show that 1 − ρ12(2, 3,L, p) = (1 − ρ12
2 2
) (1 − ρ13 2 2
.2 ) (1 − ρ14.23 ) L (1 − ρ1 p.23L p −1 ) ,
1 ( p − 1) ρ 2 ρ
i) ρ≥− , and ii) ρ12(2, 3,L, p ) = . since ρ12 = ρ , and ρ13.2 = .
p −1 1 + ( p − 2) ρ 1+ ρ
Multiple and partial correlations 109
ρ
2 (1 + ρ ) 2 − ρ 2 (1 − ρ ) (1 + 2 ρ )
ρ12(2, 3) = (1 − ρ 2 ) 1 − = (1 − ρ 2 ) = .
1 + ρ (1 + ρ ) 2 (1 + ρ )
Similarly
(1 − ρ ) (1 + 2 ρ ) ρ
2
1 − ρ12( 2, 3, 4) = (1 − ρ12
2 2
) (1 − ρ13 2
.2 ) (1 − ρ14.23 ) =
1 −
(1 + ρ ) 1 + 2 ρ
(1 − ρ ) (1 + 2 ρ ) 1 + 3ρ 2 + 4 ρ (1 − ρ ) (1 + 3ρ )
= = .
(1 + ρ ) (1 + 2 ρ ) 2 (1 + 2 ρ )
By the method of induction
1 + ( p − 1) ρ
1 − ρ12( 2, 3,L, p ) = (1 − ρ )
1 + ( p − 2) ρ
1 + pρ − 2 ρ − [1 + pρ − 2 ρ − ( p − 1) ρ 2 ] ( p − 1) ρ 2
ρ12(2, 3, L, p) = = .
1 + ( p − 2) ρ 1 + ( p − 2) ρ
112 RU Khan
HOTELLING'S − T 2 and
ΣY = E [Y − E (Y )][Y − E (Y )]′ = Σ .
If X is univariate normal with mean µ and standard deviation σ , then
Therefore,
n (x − µ) 1 (n − 1) s 2
U=
σ
~ N (0,1) , and V =
2 ∑ ( xi − x ) 2 = 2
~ χ n2−1 , where s 2 is the
Y ~ N p (υ , Σ) , then
σ i σ
sample variance from a sample of size n . If U and V are independently distributed, then
Student's − t is defined as T 2 = Y ′ S −1Y .
n (x − µ) / σ n (x − µ)
Since Σ is positive definite matrix, there exits a nonsingular matrix C , such that
U
t= = = ~ t n−1 .
V / n −1 (n − 1) s 2 /(n − 1) σ 2 s C Σ C′ = I ⇒ C ′C = Σ −1 .
Define,
The multivariate analogue of Student's − t is Hotelling's T 2 .
If x α (α = 1, 2, L , n) is an independent sample of size n from N p ( µ , Σ ) and, if x is the Y * = CY , S * = C S C ′ , and υ * = Cυ , then
sample mean vector, S the matrix of variance covariance, then the Hotelling's − T 2 is E (Y * ) = C n E ( x − µ ) = C n ( µ − µ ) = Cυ = υ *
0 0
defined by the relation
and
T 2 = n ( x − µ ) ′ S −1 ( x − µ ) .
Σ = E [Y * − E (Y * )][Y * − E (Y * )]′ = C Σ C ′ = I .
Y*
Result:
Therefore,
Let the square matrix A be partitioned as
Y * ~ N p (υ * , I ) , then
A11 A12
A = , so that A22 is square. If A22 is nonsingular, let
A21 A22 ′ −1
T 2 = Y * S* Y *
I − A12 A −1 −1
0 A11.2 0
C = 22 , then CAC ′ = A11 − A12 A22 A21 =
and
0
I
0 A22 0 A22 n−1 ′
(n − 1) S * = ∑ Z α* Z *α , where Z α
*
= C Z α ~ N (0, I ) .
A 0 −1 α =1
⇒ A = C −1 11.2 C ′ , then
0 A22 Now consider a random orthogonal matrix Q of order p × p
−1 −1 −1 −1
A 0 A11 − A11 .2 A12 A22
Y* Y2* Y p*
A −1 = C ′ 11.2 C = .2 . 1
0 A22 − A A A −1
−1 −1 −1 −1
A22 A21 A11.2 A12 A22 + −1
A22 ′ ′
L
22 21 11.2 Y Y*
* *′ *
Y* Y* Y Y
Q= q q 22 L q2 p .
Distribution of Hotelling's − T 2 21
M M L M
Let x1 , L , x n be an independent sample drawn from N p ( µ , Σ) , then
q p1 q p2 L q pp
n −1
A = (n − 1) S = ∑ Z α Z α ′ with Z α independent, each with distribution N p (0, Σ) . Let
α =1
U = Q Y * , be an orthogonal transformation, also B = (bij ) = Q (n − 1) S *Q ′ .
By definition,
So that
T 2 = n ( x − µ ) ′ S −1 ( x − µ ) .
2 2 2 ′
Let Y1* + Y2* + K + Y p* Y* Y* ′
U1 = = = Y* Y*
Y = n ( x − µ ) , then E (Y ) = n E ( x − µ ) = n ( µ − µ ) = υ (say) *′ * *′ *
0 0 0 Y Y Y Y
Hotelling’s distribution 113 114 RU Khan
and ⇒ b11.2 is a χ 2 with (n − p) degree of freedom, but the conditional distribution of b11.2
p ′ p
Yi* does not depend upon Q . Therefore b11.2 is unconditionally distributed as a χ 2 with
U j = ∑ q j i Yi* = Y * Y * ∑ q ji , j = 2, 3,K , p
′ (n − p) degree of freedom.
i =1 i =1 Y* Y*
=0 ∀ j = 2, 3,K , p , by using the property of an orthogonal matrix. Since Y * is distributed according to N p (υ * , I )
Thus, ′ ′ ′
⇒ Y * Y * ~ χ 2p (υ * υ * ) , where υ * υ * = υ ′ Σ −1 υ = λ 2 (say).
′ −1
T 2 = Y * S * Y * = (Q −1U )′ S *−1 (Q −1U ) = U ′ (Q S *Q ′) −1U
T2
Thus the is the ratio of a noncentral χ 2 with p degree of freedom to an independent
= (n − 1)U ′ [Q (n − 1) S *Q ′]−1U = (n − 1) U ′ B −1U . n −1
χ 2 with (n − p) degree of freedom, i.e.
b11 b12 L b1 p U1
T2 b 21 b 22 L b 2 p 0 T2 n− p χ 2p (λ2 ) / p
⇒ = U ′ B −1U = (U 1 0 L 0) = ~ F p , n − p (λ 2 ) .
n −1 M M M M M n −1 p χ n2− p /(n − p )
b p1 b p 2 L b pp 0
If µ = µ , then the F − distribution is central.
0
= U12 b11 , where (b ij ) = B −1 .
Alternative proof
We know that if the square matrix say B is partition as
By definition,
b b12′
B = 11 , and if B22 is nonsingular, then T2
b B22 T 2 = n ( x − µ )′ S −1 ( x − µ ) or = n ( x − µ 0 )′ A −1 ( x − µ 0 ) .
12
n −1
−1
b11 −1
− b11 −1
.2 b12 B22
B −1 = .2 . Let
− B −1 b b −1 −1 −1 ′ −1
B22 b 21 b11.2 b12 B22 + −1
B22
22 21 11.2 d = n ( x − µ ) , ⇒ d ~ N p (0 , Σ)
0
Thus,
′ ⇒ d d ′ = Q ~ W p (1, Σ) (6.1)
Y* Y*
U 12 b11 = , and A ~ W p (υ , Σ) (6.2)
b11.2
1 1 Now
where b11 = = .
b11.2 b − b ′ B −1 b
11 12 22 12 1 d′
= A 1 + d ′ A −1 d or A + d d′
Therefore, −d A
′
T2 Y* Y* A 1 1 1
= . ⇒ = = = (6.3)
A + d d′ ′ −1
n − 1 b11.2 −1
1+ d A d ′
1 + n (x − µ 0 ) A (x − µ 0 ) T2
1+
But n −1
n −1 n−1 1 A
B = Q (n − 1) S *Q ′ = ∑ (Q Z *α ) (Q Z α* ) ′ = ∑ V α V α ′ , We determine the distribution Φ =
T 2
=
A+Q
.
α =1 α =1 1+
n −1
*
where, V α = Q Z α ~ N p (0, I ) , for given Q .
A and Q are independent as Q is based on x . Thus from (6.1), (6.2) and the additive
n−1−( p −1) property of wishart distribution, we have
⇒ b11.2 is distributed as ∑ wα2 , where wα are independent N (0, 1) .
A + Q ~ W p ( n, Σ )
α =1
Hotelling’s distribution 115 116 RU Khan
1 −1 1 1 r l −1 β (r + l , m) Γ (r + l ) Γ (m) Γ(l + m)
(1 − x) m −1 dx =
β (l , m) ∫0
(υ + 2 r − p −1) / 2 − 2 tr A Σ E (x r ) = x x =
A e f (Q ) dA dQ (6.4) β (l , m) Γ (r + l + m) Γ(l ) Γ(m)
Γ(r + l ) Γ (l + m)
=
Note that expression inside the bracket is W p (υ + 2r , Σ) . Γ(r + l + m) Γ(l )
Let A + Q = U , we will now integrate over the constant surface of A + Q = U , from the So if we write
additive property of wishart distribution υ − p +1 p υ +1
l= , and m = , ⇒ l+m=
1 −1
2 2 2
r C ( p, υ ) C ( p,υ + 2r + 1) (υ + 2r +1− p −1) / 2 − 2 tr U Σ
E (Φ ) =
C ( p, υ + 2r ) Σ
−r u ∫ U
r
Σ
(υ + 2r +1) / 2
U e du υ − p + 1 p
⇒ Φ ~ β1 , .
2 2
1
C ( p,υ ) C ( p,υ + 2r + 1) C ( p,υ + 1) (υ +1− p −1) / 2 − 2 tr U Σ
−1 υ * υ* 1 − X υ1*
= ∫ U e du We know that if X ~ β1 1 , 2 , then × ~ Fυ * , υ * .
C ( p,υ + 2r ) C ( p,υ + 1) U r (υ +1) / 2 2 2 X υ 2* 2 1
Σ
p So that
υ + 2r − i + 1 (υ +1) p / 2
2 (υ + 2r ) p / 2 (π ) p ( p −1) / 4 ∏ Γ 2 1− Φ υ − p +1
i =1
2 × ~ F p,υ − p +1
= Φ p
p
υ − i + 1 (υ + 2 r +1) p / 2
2υp / 2 (π ) p ( p −1) / 4 ∏ Γ 2 1
i =1
2 1−
1 + T 2 / (n − 1) n − p
p ⇒ × ~ F p, n− p
υ + 1− i + 1 1 p
(π ) p ( p −1) / 4 ∏ Γ
i =1
2 1 + T 2 / (n − 1)
×
p
υ + 2r + 1 − i + 1 T2 n− p
(π ) p ( p −1) / 4 ∏ Γ ⇒ × ~ F p, n− p .
i =1
2 n −1 p
Hotelling’s distribution 117 118 RU Khan
n np / 2 1
T 2 − Statistic as a function of likelihood ratio criterion max Lω = exp − np
(2π ) np / 2
A + n ( x − µ ) ( x − µ )′
n/2 2
Let x α (α = 1, 2, L , n > p ) be a random sample of size n from N p ( µ , Σ) . The likelihood 0 0
n np / 2 1 where, T 2 = n ( x − µ ) ′ S −1 ( x − µ ) = n (n − 1) ( x − µ ) ′ A −1 ( x − µ ) .
= exp − np . 0 0 0 0
(2π ) np / 2 A
n/2 2
The likelihood ratio test is defined by the critical region λ ≤ λ0 , where, λ0 is so chosen so as
Similarly, to have level α , i.e. Pr [λ ≤ λ 0 H 0 ] = α .
1 Thus
max Lω =
n/2
1 T2
(2π ) np / 2 ∑ ( x − µ 0 ) ( x α − µ 0 )′
n α α λ2 / n ≤ λ02 / n , or
1
≤ λ02 / n , or 1 + ≥ λ−0 n / n ,
2 n −1
1 + T /(n − 1)
−1
1 or T 2 ≥ (n − 1) (λ−0 2 / n − 1) = T02 , (say)
exp − tr n ∑ ( xα − µ 0 ) ( x α − µ 0 )′ ∑ ( x α − µ 0 ) ( x α − µ 0 ) ′
2 α α
⇒ T 2 ≥ T02 .
n np / 2 1 Therefore, Pr [T 2 ≥ T02 H 0 ] = α .
= exp − np .
n/2 2
(2π ) np / 2 ∑ ( xα − µ ) ( x α − µ )′
0 0 Invariance property of T 2
α
T2 n− p Let
T 2 = Y ′ S −1Y , and ~ F p, n − p .
n −1 p n1n2
Y= ( x (1) − x (2) ) , then,
Thus adopting a significance level of size α , the null hypothesis is rejected if n1 + n 2
(n − 1) p n1n2
T 2 ≥ T02 , where, T02 = F p, n− p (α ) . EY = 0 , under H 0 , and ΣY = E ( x (1) − x (2) ) ( x (1) − x ( 2) )′ = Σ .
n− p n1 + n 2
n1 + n 2 − 2
S (i ) = ∑ ( x (i ) − x (i ) ) ( x α(i ) − x (i) )′
ni − 1 α =1 α
= A(1) + A (2) = ∑ Z α Z α ′ , with Z α ~ N p (0, Σ) .
α =1 and
n1 + n2 − 2 q ni
1
Hence, (n1 + n2 − 2) S is distributed as ∑ Zα Zα′ . S=
q ∑ ∑ ( xα(i) − x (i ) ) ( xα(i ) − x (i ) )′
α =1
∑ ni − q i =1 α =1
Therefore, by definition i =1
nn q q ni
T 2 = Y ′ S −1 Y = 1 2 ( x (1) − x (2) ) ′S −1 ( x (1) − x ( 2) ) n − qS =
n1 + n 2 ∑ i ∑ ∑ ( xα(i) − x (i ) ) ( xα(i) − x (i ) ) ′
i =1 i =1 α =1
and
∑ ni −q
T2 n1 + n2 − 2 − ( p − 1) i
~ F p, n1 + n 2 − p −1 . = ∑ Z α Z α ′ , where, Z α ~ N p (0, Σ)
n1 + n2 − 2 p α =1
Thus adopting a significance level of size α , the null hypothesis is rejected if T 2 ≥ T02 , Therefore, by definition
(n1 + n 2 − 2) p ′
where, T02 = F p , n1 + n2 − p −1 (α ) .
n1 + n2 − p − 1 T 2 = Y ′ S −1 Y = C ∑ β i x (i ) − µ S −1 ∑ β i x (i ) − µ is distributed as T 2 with
i i
q − sample problem ∑ in − q degree of freedom, i.e.
i
Let x α(i ) (α = 1, 2, L , ni ; i = 1, 2, L , q) be a random sample from N p ( µ (i ) , Σ) respectively.
Suppose we are required to test ∑ ni − q − ( p − 1)
T2 i
~ F p, ∑ ni −q − p +1 .
q
H 0 : ∑ βi µ (i )
= µ , where β1 , L , β q are given scalars and µ is given vector.
∑ ni − q p i
i
i =1
Let Thus adopting a significance level of size α , the null hypothesis is rejected if T 2 ≥ T02 ,
n where
1 i (i )
x (i ) = ∑ x be the sample mean vector, and x (i ) ~ N p ( µ (i ) , Σ / ni ) , then,
ni α =1 α
∑ ni − q p
T02 =
q
1 i
∑ β i x (i) ~ N p µ , Σ , under H 0 , where E ∑ β i x (i ) = µ F p, n −q − p+1 (α ) .
i =1 C
i
∑ ni − q − p + 1 ∑i i
i
Hotelling’s distribution 123 124 RU Khan
Given a random sample x1 , L , x n from N p ( µ , Σ) , where µ ′ = ( µ1 , µ 2 ,L , µ p ) . The Let x α(i ) (α = 1, 2, L , ni ; i = 1, 2) be random sample from N p ( µ (i ) , Σ i ) . Hypothesis of
hypothesis of interest is H 0 : µ1 = µ 2 = L = µ p . interest is H 0 : µ (1) = µ ( 2) . The mean x (1) of the first sample is normally distributed with
expected value
Let C be a matrix of order ( p − 1) × p such that C η = 0 , where η ′ = (1,1, L ,1) .
E ( x (1) ) = µ (1) , and covariance matrix
A matrix satisfying this condition is
1 0 L 0 − 1 1
E ( x (1) − µ (1) ) ( x (1) − µ (1) )′ = Σ1 , i.e. x (1) ~ N p ( µ (1) , Σ1 / n1 ) .
1 L 0 − 1 n1
0
C = ,
M M M M M Similarly,
0 0 L 1 − 1 ( p −1)× p
x ( 2) ~ N p ( µ ( 2 ) , Σ 2 / n 2 ) .
where, C is called a contrast matrix. Thus,
Let
1 1
y = C x α , then, ( x (1) − x (2) ) ~ N p µ (1) − µ (2) , Σ1 + Σ 2 .
α n
1 n 2
Ey = C µ , and Σ y = E [ y − E ( y )][ y − E ( y )]′ = CΣC ′ , If n1 = n 2 = n
α α α α α α
with this transformation we can write Let y = x α(1) − x α(2) , (assuming the numbering of the observations in the two samples is
α
1 0 L 0 − 1 µ1 µ1 − µ p = 0 independent of the observations themselves), then y ~ N p (0, Σ1 + Σ 2 ) under H 0 .
α
0 1 L 0 − 1 µ2 µ2 − µ p = 0
H 0 : C µ = 0 , i.e. M =0, or 1 n Σ +Σ
M M M M M
M ⇒ y= ∑ y = ( x (1) − x (2) ) ~ N p 0, 1 n 2 ⇒ n y ~ N p (0, Σ1 + Σ 2 ) .
n α =1 α
0 0 L 1 − 1 ( p−1)× p µ p µ p −1 − µ p = 0
Let
Therefore,
1 n
y ~ N p −1 (0 , C Σ C ′) under H 0 , then
α Sy = ∑ ( y − y ) ( y α − y )′
n − 1 α =1 α
C ΣC′
y ~ N p −1 0, ⇒ n y ~ N p −1 (0, C Σ C ′) . n−1
n or (n − 1) S y = ∑ Z α Z α ′ , where, Z α ~ N p−1 (0, Σ1 + Σ 2 ) .
Let α =1
1 n
1 n Thus, by definition, T 2 = n y ′ S y−1 y has T 2 − distribution with (n − 1) degrees of freedom.
Sy = ∑ ( y − y ) ( y α − y )′ = n − 1 ∑ C ( xα − x ) ( xα − x )′C ′ = CSC ′
n − 1 α =1 α α =1 (n − 1) p
The critical region is T 2 ≥ F p, n− p (α ) .
n−1 ′ n− p
(n − 1) S y = ∑ Z α* Z α* *
, with Z α ~ N p −1 (0, C ΣC ′) .
If n1 ≠ n 2 , and n1 < n 2 .
α =1
Thus, by definition, Define,
n1 n2
T 2 = n y ′ S y−1 y is distributed as T 2 with (n − 1) degree of freedom, and the critical region of n1 (2) 1 1
y
α
= x α(1) − x +
n2 α
∑ x (β2) − n ∑ x γ(2) , α = 1, 2, L, n1 , then
n1n 2 β =1 2 γ =1
size α for testing H 0 : C µ = 0 is
n1 n
n1 ( 2) 1 1 2 (2)
T2 ≥
(n − 1) ( p − 1)
F p −1, n − p +1 (α ) . Ey
α
= µ (1) −
n2
µ + ∑
n1n 2 β =1
µ ( 2) − ∑
n 2 γ =1
µ
n − 1 − ( p − 2)
Hotelling’s distribution 125 126 RU Khan
n So that
1 1 1 n
y= ∑
n1 α =1
y ~ N p 0, Σ1 + 1 Σ 2
α q
n β2
n1 n2 y ~ N p µ , ∑ 1 i Σ i under H 0
α n
n1 i =1 i
⇒ n1 y ~ N p 0, Σ1 + Σ 2 .
n2 and
Consider 1 1
n 1 q n1β i2 q n β2
∑ y ~ N p µ, n1 ( y − µ ) ~ N p 0, ∑ 1 i Σ i .
n1 ∑ ni i
y= Σ ⇒
n1 n1 −1 n1 α =1 α n
n i =1 i =1 i
(n1 − 1) S = ∑ ( yα − y) ( y
α
− y )′ = ∑ Z α Z α ′ , with Z α ~ N p 0, Σ1 + 1 Σ 2 .
α =1 α =1 n2 Consider
Therefore, by definition, n1
(n1 − 1) S = ∑ ( y α − y ) ( y α − y )′ .
T 2 = n1 y ′ S −1 y . This statistic has T 2 − distribution with (n1 − 1) degree of freedom. α =1
The critical region of size α is Therefore, T 2 = n1 ( y − µ )′S −1 ( y − µ ) has T 2 − distribution with (n1 − 1) degree of
(n1 − 1) p (n1 − 1) p
T2 ≥ F p, n1 − p (α ) . freedom. The critical region is T 2 ≥ F p, n1 − p (α ) , with α level of significance.
n1 − p n1 − p
Hotelling’s distribution 127 128 RU Khan
y = x (1) − x (2) , then the expectation of y is E ( y ) = E ( x (1) − x ( 2) ) = µ (1) − µ ( 2) the distribution of Hotelling’s T 2 .
and the variance covariance matrix of y is Solution: Consider the variance covariance of conditional distribution of X 1 given X (2) ,
which is σ − σ ′ Σ −1 σ , where
11 12 22 21
E [ y − E ( y )][ y − E ( y )]′ = E [( x (1) − x (2) ) − ( µ (1) − µ ( 2) )] [( x (1) − x (2) ) − (µ (1) − µ (2) )]′
σ σ 12 ′ 1
= E [( x (1) − µ (1) ) ( x (1) − µ (1) ) ′ − ( x (1) − µ (1) ) ( x ( 2) − µ ( 2) )′ Σ = 11 , and Σ = Σ 22 σ 11 − σ 12′ Σ −221 σ 21 .
σ
21 Σ 22 p − 1
− ( x ( 2) − µ ( 2) ) ( x (1) − µ (1) ) ′ + ( x ( 2) − µ (2) ) ( x (2) − µ (2) ) ′] Thus,
Σ 1
= Σ11 − Σ12 − Σ 21 + Σ 22 . = σ 11 − σ 12′ Σ 22
−1
σ 21 = , where σ 11 is the leading term of Σ −1 .
Thus,
Σ 22 σ 11
Let A = ∑ ( x α − x ) ( x α − x )′ and partitioned according as Σ
y ~ N ( µ (1) − µ (2) , Σ11 − Σ 21 − Σ12 + Σ 22 )
α
and a a12′ 1
A = 11 , and A = A22 a11 − a12 ′ A22
−1
a 21
Σ − Σ 21 − Σ12 + Σ 22 a A22 p − 1
y ~ N µ (1) − µ (2) , 11 21
n
A 1
⇒ n y ~ N (0, Σ11 − Σ12 − Σ 21 + Σ 22 ) , under H 0 . or = a11 − a12′ A22
−1
a 21 = (6.6)
A22 a11
Consider,
where a11 is the leading term of A −1 .
n
(n − 1) S = ∑ ( yα − y) ( y
α
− y)′ Since A ~ W p (n − 1, Σ) , and, is partitioned as
α =1
A11 A12 q −1
A = , then, A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11 − Σ12 Σ −221 Σ 21 )
Therefore, by definition, T 2 = n y ′ ( S11 − S12 − S 21 + S 22 ) −1 y , which has T 2 − distribution A21 A22 p − q
with (n − 1) degree of freedom. The critical region is
⇒ a11 − a12′ A22
−1
a 21 ~ W1 (n − 1 − ( p − 1), σ 11 − σ 12′ Σ 22
−1
σ 21 ) (6.7)
2 (n − 1) q
T ≥ Fq , n− q (α ) with α level of significance. In view of equation (6.6), equation (6.7) reduces as
n − 1 − (q − 1)
1 1 σ 11
Result: ~ W1 n − p, ⇒ ~ χ n2− p .
11
a σ 11 a11
Σ Σ12
If Σ = 11 , then Σ = Σ 22 Σ11 − Σ12 Σ −221 Σ 21 Of course we could obtain the var. cov. of conditional distribution of any component X j
Σ
21 Σ 22
jj
σ
Σ 1 ( j = 1, 2,L , p ) of X given the other components. Hence, ~ χ n2− p , where, σ jj and a jj
or = Σ11 − Σ12 Σ −221 Σ 21 = , where σ 11 is the leading term of Σ −1 . a jj
Σ 22 σ 11
are the leading term of Σ −1 and A −1 respectively.
Hotelling’s distribution 129 130 RU Khan
Let C be a nonsingular matrix and let D 2 = ( x (1) − x ( 2) )′ S −1 ( x (1) − x ( 2) ) and is known as Mahalanobis's D 2 ,
−1′
B = C −1 , ⇒ B = C ′ , as A ~ W p (n − 1, Σ) , and BAB ′ ~ W p (n − 1, BΣB ′) . where
2 ni
(n1 − 1) S (1) + (n2 − 1) S (2) 1
Now S=
n1 + n 2 − 2
= ∑ ∑ ( x (i ) − x (i ) ) ( xα(i ) − x (i) )′ ,
n1 + n2 − 2 i =1 α =1 α
( BAB ′) −1 = B ′ −1 A −1 B −1 = C ′A −1C , and ( BΣB ′) −1 = B ′ −1Σ −1 B −1 = C ′Σ −1C
n
1 i (i )
Let C = C 1′ L C p ′ , where C i is a p − component vector, then the leading term in ( BAB ′) −1 where x (i ) = ∑ x , i = 1, 2 .
ni α =1 α
C ′ Σ −1 C 1
is C 1′ A −1 C 1 and leading term in ( BΣB ′) −1 is C 1′ Σ −1 C 1 , thus, 1 ~ χ n2− p , as both It is obvious that
C 1′ A −1 C 1 n1n 2
the matrices are of order 1 × 1 . T2 = D2 ,
n1 + n 2
The result is true for any C 1 even when C 1 is obtained by observing a random vector, which
n1n 2
is distributed independently of A . i.e. two-sample T 2 and D 2 are almost the same, except for the constant k 2 = .
n1 + n 2
u ′ Σ −1 u
Hence, if u and A are independently distributed then, ~ χ n2− p . Let
u ′ A −1 u
Y = k ( x (1) − x (2) ) , then expected value of Y is E (Y ) = k (µ (1) − µ (2) ) = δ
We know that
and the variance covariance matrix of Y is
V1 = n ( x − µ )′ Σ −1 ( x − µ ) ~ χ 2p
0 0
ΣY = k 2 E[( x (1) − x ( 2) ) − (µ (1) − µ ( 2) )][( x (1) − x (2) ) − ( µ (1) − µ ( 2) )]′
Let u = (x − µ )
0
1 1 n + n2 1
= k 2 + Σ = Σ , because 1 = .
( x − µ 0 )′ Σ −1 ( x − µ 0 ) n1 n 2 n1n2 k2
V2 = ~ χ n2− p
( x − µ 0 )′ A −1 ( x − µ 0 ) Therefore,
Y ~ N p (δ , Σ) , then, k 2 D 2 = Y ′ S −1 Y .
(n − 1) ( x − µ 0 )′ Σ −1 ( x − µ 0 ) n
= × ~ χ n2− p , where A = (n − 1) S Since Σ is positive definite matrix there exist a nonsingular matrix C such that
( x − µ 0 ) ′ S −1 ( x − µ 0 ) n
CΣC ′ = I ⇒ CC ′ = Σ −1 .
V1 and V2 are independently distributed as χ 2 Define,
V1 / n V
T 2 = n ( x − µ ) ′ S −1 ( x − µ ) = = (n − 1) 1 Y * = C Y , S * = C S C ′ , and δ * = C δ , then,
0 0 V2 / n (n − 1) V2
′ −1
k 2 D 2 = Y * S * Y * , and the expected value of Y * is
T2 V T2 n− p V1 / p
⇒ = 1 or × = ~ F p, n − p .
n − 1 V2 n −1 n V2 / n − p E (Y * ) = C E (Y ) = Cδ = δ * , and the variance covariance matrix of Y * is
Σ = C E[Y − E (Y )][(Y − E (Y )]′C ′ = C Σ C ′ = I .
Distribution of Mahalanobis's D 2 Y*
Thus,
The quantity (µ (1) − µ (2) )′ Σ −1 ( µ (1) − µ (2) ) is denoted by ∆2 and was proposed by
′ ′
Mahalanobis as a measure of the distance between the two populations, N p ( µ (1) , Σ) , and Y * ~ N p (δ * , I ) , ⇒ Y * Y * ~ χ 2p (δ * δ * ) ,
N p ( µ (2) , Σ) . If the parameters are replaced by their unbiased estimates, is denoted by D 2 , where
which is given by ′
δ * δ * = δ ′ C ′ C δ = δ ′ Σ −1 δ = λ2 .
Hotelling’s distribution 131
Let
n1 + n2 − 2
(n1 + n 2 − 2) S = ∑ Z α Z α ′ , where Z α ~ N p (0, Σ)
α =1
n1 + n2 −2
⇒ (n1 + n 2 − 2) S * = ∑ (C Z α ) (C Z α )′ , where C Z α ~ N p (0, I ) .
α =1
Therefore,
χ 2p (λ2 )
k 2 D 2 = Y * S *−1 Y * = (n1 + n 2 − 2)
2
χ n + n −2−( p −1)
1 2
n1n 2 n1 + n 2 − p − 1 2 χ 2p (λ 2 ) / p
⇒ D = ~ F p , n1+ n2 − p −1 (λ 2 ) .
n1 + n 2 (n1 + n 2 − 2) p χ n2 +n −2−( p −1) / n1 + n2 − p − 1
1 2
The problem of discriminant analysis deals with assigning an individual to one of several q f ( x) f ( x) q 2
R1 = {x | q 2 f 2 ( x) ≤ q1 f1 ( x)} = x 1 1 ≥ 1 = x 1 ≥ .
categories on the basis of measurements on a p − component vector of variable x on that q f
2 2 ( x ) f 2 ( x ) q1
individual. For example, we take certain measurements on the skull of an animal and want to Similarly,
know whether it was male or female, a patient is to be classified as diabetic or non diabetic
on the basis of certain tests such as blood, urine, blood pressure etc., a salesman is to be q f ( x) f ( x) q 2
classified as successful or unsuccessful on different psychological tests. R2 = x 1 1 < 1 = x 1 < .
q 2 f 2 ( x) f 2 ( x ) q1
Procedure of classification into one of the two populations with known probability Further, if the cost of misclassification is given, C (2 |1) cost of misclassification to π 2 when
distribution it actually comes from π 1 , C (1| 2) the cost of misclassification to π 1 when it actually comes
Let R denote the entire p − dimensional space in which the point of observation x falls. We from π 2 . Example, if a potentially good candidate for admission to a medical school is
then have to divided R into two, say, R1 and R2 so that rejected, the nation will suffer a shortage in medical persons, but, on the contrary, if a bad
candidate is admitted, he may not be able to complete the course successfully and money,
If x falls in R1 , then assign the individual to population π 1 resources equipment used by him will be a waste. Total expected cost from misclassification
If x falls in R2 , assign the individual to population π 2 E (C ) = C (2 |1) Pr (2 |1) q1 + C (1| 2) Pr (1| 2) q2 and
Obviously, with any such procedure an error of misclassification is inevitable (unavoidably) classification rule will be
i.e. the rule may assign an individual to π 2 , when he really belongs to π 1 and vice versa. A
f ( x ) q2 C (1| 2) f ( x ) q 2 C (1| 2)
rule should control this error of discrimination. R1 = x 1 ≥ , and R2 = x 1 < .
f 2 ( x) q1 C (2 |1) f 2 ( x) q1 C (2 |1)
Let f1 ( x) and f 2 ( x) are the probability density function of x in the two populations π 1 and
π 2 . Let Classification into one of two known multivariate normal populations
q1 = a priori probability that an individual comes from π 1 Let f1 ( x) = density function of N p ( µ (1) , Σ) and f 2 ( x ) = density function of N p ( µ ( 2) , Σ)
q 2 = a priori probability that an individual comes from π 2 and the region R1 of the classification into the population first is given by
Pr (1| 2) = Pr ( an individual belongs to π 2 is misclassified to π 1 ) f1 ( x ) q2 C (1 2 )
R1 = x ≥ k , where k = . Consider
Pr (2 |1) = Pr ( an individual belongs to π 1 is misclassified to π 2 ) . f 2 ( x) q1 C ( 2 1 )
Obviously, f ( x)
ln 1 ≥ ln k (7.2)
Pr (1| 2) = ∫ f 2 ( x ) d x , and Pr (2 |1) = ∫ f1 ( x) d x . f 2 ( x)
R1 R2
The left hand side of (7.2) can be expanded as
Since the probability of drawing an observation from π 1 is q1 and from π 2 is q2 , we have
1
ln f1 ( x) − ln f 2 ( x) = − {( x − µ (1) )′ Σ −1 ( x − µ (1) ) − ( x − µ (2) )′ Σ −1 ( x − µ ( 2) )}
Pr ( drawing an observation from π 1 and is misclassified as from π 2 ) = q1 Pr (2 |1) 2
Pr ( drawing an observation from π 2 and is misclassified as from π 1 ) = q 2 Pr (1| 2) 1 ′ ′
= − {( x ′ Σ −1 x − x ′ Σ −1 µ (1) − µ (1) Σ −1 x + µ (1) Σ −1 µ (1) )
Then the total chance of misclassification, say φ , is 2
′ ′
φ = q1 Pr (2 |1) + q 2 Pr (1| 2) (7.1) − ( x ′ Σ −1 x − x ′ Σ −1 µ (2) − µ ( 2) Σ −1 x + µ ( 2) Σ −1 µ ( 2) )}
We choose regions R1 and R2 such that equation (7.1) is minimized. The procedure that 1 ′ ′
minimize (7.1) for a given q1 and q2 is called a Bayes procedure. Consider, = − {−2 x ′ Σ −1 µ (1) + 2 x ′ Σ −1 µ ( 2) + µ (1) Σ −1 µ (1) − µ (2) Σ −1 µ (2) }
2
φ = q1 Pr (2 |1) + q 2 Pr (1| 2) = q1 ∫ f ( x) d x + q2 ∫
f ( x) d x 1 (1) ′ −1 (1) ′
R2 1 R1 2 = x ′ Σ −1 ( µ (1) − µ ( 2) ) − (µ Σ µ − µ (2) Σ −1 µ (2) ) .
2
= q1 ∫ f ( x ) d x + q1 ∫f ( x) d x − q1 ∫f ( x) d x + q 2 ∫
f ( x) d x Thus,
R2 1 R1 1 R1 1 R1 2
1 ′ ′
= q1 ∫ f1 ( x ) d x + ∫ [q 2 f 2 ( x ) − q1 f1 ( x)] d x R1 = { x | x ′ Σ −1 (µ (1) − µ ( 2) ) − ( µ (1) Σ −1 µ (1) − µ (2) Σ −1 µ ( 2) ) ≥ ln k} .
R R 1 2
Discriminant analysis 135 136 RU Khan
1 (1)
R 2 = { x x ′ Σ −1 (µ (1) − µ (2) ) < (µ + µ ( 2) )′ Σ −1 (µ (1) − µ ( 2) )} Sample discriminant function
2
The regions are computed easily as follows: Suppose that we have a sample x1(1) , L , x (n1) from N p ( µ (1) , Σ) and a sample x1(2) , L , x (n2)
1 2
Consider
from N p ( µ ( 2) , Σ) , and the unbiased estimate of µ (1) is
−1 (1) (2)
Σ d = δ , then solve, Σδ = d (by Doolittle method), where d = µ −µ .
1 n1 (1) 1 n2 ( 2)
Probability of misclassification (Two known p − variate normal population)
x (1) = ∑
n1 α =1
x α , and µ (2) is x ( 2) = ∑ x , and of Σ is S defined by
n2 α =1 α
1 (1) Substitute these estimates for the parameters in the function x ′ δ , Fisher’s discriminant
R 2 = { x x ′ Σ −1 (µ (1) − µ (2) ) < (µ + µ ( 2) )′ Σ −1 (µ (1) − µ ( 2) )} . Put
2 function becomes
Make a transformation Consider the m populations say π 1 ,L , π m with the priori probabilities q1 ,L , q m and the
′ density functions f1 ( x ),L , f m ( x ) . We wish to divide the p − dimensional space R , in
U − µ (1) δ which the point of observation x falls, into m mutually exclusive and exhaustive regions
= y , ⇒ du = ∆ dy .
∆ R1 ,L , Rm .
Discriminant analysis 137 138 RU Khan
We would like to choose regions R1 ,L , Rm to make this expected cost minimum. There are m C 2 combinations of U jk (x) for m categories
It can be seen that the classification rule comes out to be
For m = 3 , combinations are U 12 U 13 U 23 .
Assign x to π k if
f ( x) q
m m ln 1 = U 12 > ln 2
f 2 ( x) q1
∑ qi f i ( x) C (k | i ) < ∑ qi f i ( x) C ( j | i ) , j = 1, 2, L , m ; j ≠ k
q3
assign x to π 1
i =1 i =1 f1 ( x )
ln = U 13 > ln
i≠k i≠ j f 3 ( x) q1
Let X ~ N p ( µ (i) , Σ) , i = 1, 2,L , m , let the cost of misclassification be equal. Then the rule is f ( x) q
ln 2 = U 21 > ln 1
assign x to π k if f1 ( x ) q2
assign x to π 2
f ( x) q
m m ln 2 = U 23 > ln 3
∑ qi f i ( x) < ∑ q i f i ( x ) , j = 1, 2, L , m ; j ≠ k f 3 ( x) q 2
i =1 i =1
i≠k i≠ j f ( x) q
ln 3 = U 31 > ln 1
m m f1 ( x ) q3
∑ qi f i ( x ) + q k f k ( x) − q k f k ( x) < ∑ qi f i ( x ) + q j f j ( x ) − q j f j ( x) assign x to π 3 .
f ( x) q
i =1 i =1 ln 3 = U 32 > ln 2
i≠k i≠ j f 2 ( x) q3
m m m
Sample discriminant function
⇒ ∑ qi f i ( x ) + q k f k ( x ) = ∑ q i f i ( x ) = ∑ q i f i ( x ) + q j f j ( x)
i =1 i =1 i =1 X ~ N p ( µ (i) , Σ) , µ (i ) , Σ are unknown, then replace µ (i ) by x (i ) and Σ by S , where
i≠k i≠ j
⇒ q k f k ( x) > q j f j ( x) 1 m ni ′ m
S=
m ∑ ∑ ( x iα x ′ iα − n i x (i ) x ( i ) ) or ∑ ni − m S = A1 + A2 + L + Am .
i=1
⇒
f k ( x) q j
> . ∑ ni − m i=1 α =1
f j ( x) q k i =1
Discriminant analysis 139 140 RU Khan
Substitute these estimates for the parameters in the function, the function becomes V (U ) = E (U − E U ) (U − E U ) ′
′ 1
1 = E x ′ Σ −1 (µ (1) − µ ( 2) ) − ( µ (1) + µ (2) ) ′ Σ −1 ( µ (1) − µ (2) )
V jk ( x) = x − ( x ( j ) + x ( k ) ) S −1 ( x ( j ) − x (k ) ) .
2 2
1 (1)
Tests associated with discriminant functions − (µ − µ (2) )′ Σ −1 (µ (1) − µ (2) )
2
1) Goodness of fit of a hypothetical discriminant function
′ −1 (1) (2) 1 (1) ( 2) −1 (1) ( 2)
H 0 : A given function x ′ δ is good enough for discriminating between two populations. x Σ ( µ − µ ) − 2 (µ + µ )′ Σ (µ − µ )
We use the test statistic ′
1 (1)
− (µ − µ (2) ) ′ Σ −1 ( µ (1) − µ ( 2) )
n1 + n 2 − p − 1 k 2 ( D 2p − D12 ) 2
~ F p−1, n + n − p −1 ,
p −1 (n1 + n2 − 2) + k 2 D 2 1 2
1 1 1
= E x ′ − (µ (1) + µ ( 2) ) ′ − ( µ (1) − µ (2) ) ′Σ −1 ( µ (1) − µ (2) )
n1n2 2 2
where k 2 = , D 2p = ( x (1) − x (2) )′S −1 ( x (1) − x ( 2) ) is the studentized D 2 of
n1 + n2 ′
′ 1 (1) ( 2) 1 (1) (2) −1 (1) (2)
Mahalanobis, based on the p characters x , and x − 2 ( µ + µ )′ − 2 ( µ − µ ) ′Σ (µ − µ )
(δ ′ d ) 2
D12 = , where d = ( x (1) − x (2) ) , δ = S −1 ( x (1) − x (2) ) . = E [( x − µ (1) )′ Σ −1 (µ (1) − µ (2) )][( x − µ (1) )′ Σ −1 (µ (1) − µ ( 2) )]′
δ ′ Sδ
2) Test for additional information = ( µ (1) − µ ( 2) ) ′ Σ −1 E ( x − µ (1) ) ( x − µ (1) )′ Σ −1 ( µ (1) − µ ( 2) )
H 0 : ∆2p = ∆2q (i.e. p − q components do not provide additional information). = ( µ (1) − µ (2) ) ′ Σ −1 ( µ (1) − µ (2) ) = α (say).
Solution: Given
1 (1)
x ~ N p ( µ (1) , Σ) , and U = x ′ Σ −1 ( µ (1) − µ (2) ) − (µ + µ ( 2) ) ′ Σ −1 ( µ (1) − µ (2) )
2
′ 1
E (U ) = µ (1) Σ −1 ( µ (1) − µ ( 2) ) − ( µ (1) + µ ( 2) )′ Σ −1 (µ (1) − µ (2) )
2
1 (1)
= (µ − µ ( 2) )′ Σ −1 (µ (1) − µ (2) ) , and
2
142 RU Khan
PRINCIPAL COMPONENTS ∂φ
where λ is a Lagrange multiplier, the vector of partial derivatives is
∂β i
In multivariate analysis the dimension of X causes problems in obtaining suitable statistical
analysis to analyze a set of observations (data) on X . It is natural to look for method for ∂φ
= 2 Σ β − 2 λ β , equating this to zero, gives
rearranging the data, so that with as little loss of information as possible, the dimension of ∂β
problem is considerably reduced. This reduction is possible by transforming the original
variable to a new set of variables which are uncorrelated. These new variables are known as Σβ − λ β = 0 ,
principal components.
⇒ (Σ − λ I ) β = 0 . (8.2)
Principal components are normalized linear combination’s of original variables which has
specified properties interms of variance. They are uncorrelated and are ordered, so that the The equation (8.2) admits a non-zero solution in β if only,
first component displays the largest amount of variation, the second components displays the
second largest amount of variation, and so on. Σ −λ I = 0 . (8.3)
If there are p − variables, then p components are required to reproduce (rearrange) the total i.e. (Σ − λ I ) is a singular matrix.
variability present in the data, this variability can be accounted for by a small number, k
The function Σ −λ I is a polynomial in λ of degree p (i.e.
(k < p ) of the components. If this is so, there is almost as much information in the k
p p −1
components as there is in original p variables and then k components can be replace the λ + a1λ + L + a p −1λ + a p ). Therefore, the equation (8.3) will give p solutions in λ ,
original p variables. . That is why; this is considered as linear reduction technique. This let these be λ1 ≥ λ 2 ≥ L ≥ λ p .
technique produces best results when the original variables are highly correlated, positively
or negatively. Pre-multiplying (8.2) with β ′ , gives
For example, suppose we are interested in finding the level of performance in mathematics of
the tenth grade students of a certain school. We may then record their scores in mathematics: β ′Σ β − λ β ′ β = 0
i.e., we consider just one characteristic of each student. Now suppose we are instead
interested in overall performance and select some p characteristics, such as mathematics, ⇒ β ′Σ β = λ . (8.4)
English, history, science, etc. These characteristics, although related to each other, may not
all contain the same amount of information, and in fact some characteristics may be This shows if β satisfies equation (8.2) and β ′ β = 1 , then the variance of β ′ X is λ .
completely redundant. Obviously, this will result in a loss of information and waste of
resources in analyzing the data. Thus, we should select only those characteristics that will Thus for the maximum variance we should use in (8.2) the largest root λ1 . Let β (1) be a
truly discriminate one student from another, while those least discriminatory should be normalized solution of
discarded.
(Σ − λ1 I ) β = 0 i.e. Σ β (1) = λ1 β (1) , then
Determination
′
Let X be a p − component vector with covariance matrix Σ be positive definite (all the U 1 = β (1) X
characteristic roots are positive and distinct). Since we are interested only in the variances
is a normalized linear combination (first principal component) with maximum variance i.e.
and covariances, we shall assume that E ( X ) = 0 . Let β is a p − component column vector
′
such that β ′ β = 1 (in order to obtain a unique solution). V (U1 ) = β (1) Σ β (1) = λ1 [from equation (9.4)].
Define, We have next to find a normalized linear combination β ′ X that has maximum variance of
U = β ′ X , then E U = E ( β ′ X ) = 0 , and V ( β ′ X ) = β ′ E ( X X ′ ) β = β ′ Σ β (8.1) all linear combinations un-correlated with U 1 . Non-correlation with U 1 is same as
′ ′
The normalized linear combination β ′ X with maximum variance is therefore obtained by 0 = Cov (β ′ X , U1 ) = E (β ′ X − E β ′ X ) (β (1) X − E β (1) X )′
maximizing equation (8.1) subject to condition β ′ β = 1 . We shall use technique of Lagrange = E (β ′ X X ′ β (1) ) = β ′ Σ β (1) .
multiplier and maximize
Since Σ β (1) = λ1 β (1) , then
φ = β ′ Σ β − λ ( β ′ β − 1) = ∑ β iσ ij β j − λ ∑ β i2 − 1 ,
i, j
i
β ′ Σ β (1) = λ1 β ′ β (1) = 0 . (8.5)
Principal components 143 144 RU Khan
where λ and υ1 are Lagrange multipliers. The vector of partial derivatives is Premultiplying by β ( j ) ′ , we obtain
∂φ 2 ′ ′ ′
= 2 Σ β − 2 λ β − 2 υ1 Σ β (1) , and we set this to 0 . β ( j ) Σ β − λ β ( j ) β −υ j β ( j ) Σ β ( j ) = 0 , using equation (8.5)
∂β
′
⇒ Σ β − λ β −υ1 Σ β (1) = 0 . ⇒ −υ j β ( j ) Σ β ( j ) = 0
Thus given a positive definite matrix Σ , there exist an orthogonal matrix Β such that Β′ Σ Β The sum of the variances of the components of V is
is in diagonal form and the diagonal elements of Β′ Σ Β are the characteristic roots of Σ . p p
Therefore, we proceed as follows: ∑ E (Vi ) 2 = tr (CΣC ′) = tr ΣC ′C = tr ΣI = tr Σ = ∑ E ( X i ) 2 .
i =1 i =1
Solve Σ − λ I = 0 . Let the roots be λ1 > λ 2 > L > λ p , then solve
Corollary: The generalized variance of the vector of principal components is the generalized
(Σ − λ1 Ι) β = 0 , get a solution β (1) i.e. Σ β (1) = λ1 β (1) . variance of the original vector.
M But
r − th principal component tr ΣU = V (u1 ) + LV (u p ) , and tr Σ X = V ( x1 ) + LV ( x p ) .
(r )′
Ur = β X 1 r12
Exercise: Let R = . Solve
r12 1
Stop at m (≤ p) for which λ m is negligible as compared to λ1 , L , λ m −1 . Thus we get a
fewer (countable less) linear combinations or at most as many linear combinations as the 1− λ r12
original variable ( p) . R − λ I = 0 , or , or (1 − λ ) 2 − r12
2
=0
r12 1− λ
λ1
Contribution of first principal component = or λ2 − 2 λ + 1 − r 2 = 0 , r12 = r
p
∑ λi
i =1 2 ± 4 − 4 (1 − r 2 )
or λ = =1± r
λ + λ2 2
Contribution of ( I + II ) principal component = 1
p
If r > 0 λ1 = 1 + r , λ2 = 1 − r
∑ λi
i =1 If r < 0 λ1 = 1 − r , λ2 = 1 + r
λ1 + λ2 + λ3 If r = 0 λ1 = 1 = λ2 ,
Contribution of ( I + II + III ) principal component = .
p
∑ λi i.e. in case of perfect correlation, we need one principal component which explains fully but
i =1 in case of zero correlation, no principal component.
Theorem: An orthogonal transformation V = C X of a random vector X leaves invariant Exercise: Find the variance of the first principle component of the covariance matrix Σ
the generalized variance and the sum of the variances of the components. defined by