The Multivariate Normal Distribution: Exactly Central Limit
The Multivariate Normal Distribution: Exactly Central Limit
• While real data are never exactly multivariate normal, the normal density
is often a useful approximation to the “true” population distribution because
of a central limit effect.
• One advantage of the multivariate normal distribution stems from the fact
that it is mathematically tractable and “nice” results can be obtained.
1
To summarize, many real-world problems fall naturally within the framework
of normal theory. The importance of the normal distribution rests on its dual
role as both population model for certain natural phenomena and approximate
sampling distribution for many statistics.
2
3.2 The Multivariate Normal density and Its Properties
• Recall that the univariate normal distribution, with mean µ and variance σ 2,
has the probability density function
1 2
f (x) = √ e−[(x−µ)/σ] /2
−∞<x<∞
2πσ 2
• The term 2
x−µ
= (x − µ)(σ 2)−1(x − µ)
σ
3
• A p-dimensional normal density for the random vector X 0 = [X1, X2, . . . , Xp]
has the form
4
Example 3.1 (Bivariate normal density) Let us evaluate the p = 2 variate
normal density in terms of the individual parameters µ1 = E(X1), µ2 =
√ √
E(X2), σ11 = Var(X1), σ22 = Var(X2), and ρ12 = σ12/( σ11 σ22) =
Corr(X1, X2).
1
Σe = λe implies Σ−1e = e
λ
5
6
Constant probability density contour
(x − µ)0Σ−1(x − µ) = c2
√
These ellipsoids are centered at µ and have axes ±c λiei, where Σei = λi for
i = 1, 2, . . . , p.
7
Example 4.2 (Contours of the bivariate normal density) Obtain the axes
of constant probability density contours for a bivariate normal distribution when
σ11 = σ22
8
The solid ellipsoid of x values satisfying
(x − µ)0Σ−1(x − µ) ≤ χ2p(α)
has probability 1−α where χ2p(α) is the upper (100α)th percentile of a chi-square
distribution with p degrees of freedom.
9
Additional Properties of the Multivariate Normal
Distribution
The following are true for a normal vector X having a multivariate normal
distribution:
10
Result 3.2 If X is distributed as Np(µ, Σ), then any linear combination of
variables a0X = a1X1 + a2X2 + · · · + apXp is distributed as N (a0µ, a0Σa). Also
if a0X is distributed as N (a0µ, a0Σa) for every a, then X must be Np(µ, Σ).
a11X1 + · · · + a1pXp
a21X1 + · · · + a2pXp
A(q×p)Xp×1 = ..
aq1X1 + · · · + aqpXp
are distributed as Nq (Aµ, AΣA0). Also X p×1 + dp×1, where d is a vector of
constants, is distributed as Np(µ + d, Σ).
11
Example 3.4 (The distribution of two linear combinations of the
components of a normal random vector) For X distributed as N3(µ, Σ),
find the distribution of
X1
X1 − X2 1 −1 0 X2 = AX
=
X2 − X3 0 1 −1
X3
12
Result 3.4 All subsets of X are normally distributed. If we respectively partition
X, its mean vector µ, and its covariance matrix Σ as
X1 µ1
(q × 1)
(q × 1)
X (p×1) =
······
µ(p×1) =
······
X2 µ2
(p − q) × 1 (p − q) × 1
and
Σ11 Σ12
(q × 1) (q × (p − q))
Σ(p×p) = ······ ······
Σ21 Σ22
((p − q) × q) ((p − q) × (p − q))
then X 1 is distributed as Nq (µ1, Σ11).
13
Result 3.5
(c) If X 1 and X 2 are independent and are distributed as Nq1 (µ1, Σ11)
X1
and Nq2 (µ2, Σ22), respectively, then has the multivariate normal
X2
distribution
µ1 Σ11 0
Nq1+q2 ,
µ2 0 Σ22
14
Example 3.6 (The equivalence of zero covariance and independence for
normal variables) Let X 3×1 be N3(µ, Σ) with
4 1 0
Σ= 1 3 0
0 0 2
16
Result 3.8 Let X 1, X 2, . . . , X n be mutually independent with X j distributed
as Np(µj , Σ). (Note that each X j has the same covariance matrix Σ.) Then
n
0
P
Consequently, V1 and V2 are independent if b c = cj bj = 0.
j=1
17
Example 3.8 (Linear combinations of random vectors) Let X 1, X 2, X 3
and X 4 be independent and identically distributed 3 × 1 random vectors with
3 3 −1 1
µ = −1 and Σ = −1 1 0
1 1 0 2
(a) find the mean and variance of the linear combination a0X 1 of the three
components of X 1 where a = [a1 a2 a3]0.
1 1 1 1
X1 + X2 + X3 + X4
2 2 2 2
and
X 1 + X 2 + X 3 − 3X 4.
Find the mean vector and covariance matrix for each linear combination of
vectors and also the covariance between them.
18
3.3 Sampling from a Multivariate Normal Distribution and
Maximum Likelihood Estimation
The Multivariate Normal Likelihood
19
• Likelihood
When the numerical values of the observations become available, they may
be substituted for the xj in the equation above. The resulting expression,
now considered as a function of µ and Σ for the fixed set of observations
x1, x2, . . . , xn, is called the likelihood.
n
P
(b) tr(A) = λi, where the λi are the eigenvalues of A.
i=1
20
Maximum Likelihood Estimate of µ and Σ
Result 3.10 Given a p × p symmetric positive definite matrix B and a scalar
b > 0, it follows that
1 −tr(Σ−1B)/2 1 pb −bp
b
e ≤ b
(2b) e
|Σ| |B|
for all positive definite Σp×p, with equality holding only for Σ = (1/2b)B.
n
1X n−1
µ̂ = X̄ and Σ̂ = (X j − X̄)(X j − X̄)0 = S
n j=1 n
For example
√ √
2. The maximum likelihood estimator of σii is σ̂ii, where
n
1X
σ̂ii = (Xij − X̄i)2
n j=1
n
1 X
X̄ and S = (X j − X̄)(X j − X̄)0 are sufficient statistics
n − 1 j=1
• Since many multivariate techniques begin with sample means and covariances,
it is prudent to check on the adequacy of the multivariate normal assumption.
1 2 population variance
σ =
n sample size
n
– For the sample variance, recall that (n−1)s2 = (Xj − X̄)2 is distributed
P
j=1
as σ 2 times a chi-square variable having n − 1 degrees of freedom (d.f.).
– The chi-square is the distribution of a sum squares of independent standard
normal random variables. That is, (n − 1)s2 is distributed as σ 2(Z12 +
2
· · · + Zn−1 ) = (σZ1)2 + · · · + (σZn−1)2. The individual terms σZi are
independently distributed as N (0, σ 2).
24
• Wishart distribution
25
• The Sampling Distribution of X̄ and S
Let X 1, X 2, . . . , X n be a random sample size n from a p-variate normal
distribution with mean µ and covariance matrix Σ. Then
1. X̄ is distributed as Np(µ, n1 Σ).
2. (n − 1)S is distributed as a Wishart random matrix with n − 1 d.f.
3. X̄ and S are independent.
26
4.5 Large-Sample Behavior of X̄ and S
Result 3.12 (Law of Large numbers) Let Y1, Y2, . . . , Yn be independent
observations from a population with mean E(Yi) = µ, then
Y1 + Y2 + · · · + Yn
Ȳ =
n
27
Large-Sample Behavior of X̄ and S
and
n(X̄ − µ)0S−1(X̄ − µ) is approximately χ2p
for n − p large.
28
3.6 Assessing the Assumption of Normality
• In situations where the sample size is large and the techniques dependent
solely on the behavior of X̄, or distances involve X̄ of the form n(X̄ −
µ)0S(X̄ − µ), the assumption of normality for the individual observations is
less crucial.
29
Therefore, we address these questions:
3. Are there any “wild” observations that should be checked for accuracy ?
30
Evaluating the Normality of the Univariate Marginal
Distributions
• Dot diagrams for smaller n and histogram for n > 25 or so help reveal
situations where one tail of a univariate distribution is much longer than
other.
or r
(0.954)(0.046) 0.628
|p̂i2 − 0.954| > 3 = √
n n
would indicate departures from an assumed normal distribution for the ith
characteristic.
32
• Plots are always useful devices in any data analysis. Special plots called
Q − Q plots can be used to assess the assumption of normality.
Let x(1) ≤ x(2) ≤ · · · ≤ x(n) represent these observations after they are
ordered according to magnitude. For a standard normal distribution, the
quantiles q(j) are defined by the relation
q(j)
j − 12
Z
1 −z2/2
P [Z ≤ q(j)] = √ e dz = p(j) =
−∞ 2π n
Here p(j) is the probability of getting a value less than or equal to q(j) in a
single drawing from a standard normal population.
• The idea is to look at the pairs of quantiles (q(j), x(j)) with the same
associated cumulative probability (j − 12 )/n. If the data arise from a normal
population, the pairs (q(j), x(j)) will be approximately linear related, since
σq(j) + µ is nearly expected sample quantile.
33
Example 3.9 (Constructing a Q-Q plot) A sample of n = 10 observation
gives the values in the following table:
34
Example 4.10 (A Q-Q plot for radiation data) The quality -control
department of a manufacturer of microwave ovens is required by the federal
government to monitor the amount of radiation emitted when the doors of the
ovens are closed. Observations of the radiation emitted through closed doors of
n = 42 randomly selected ovens were made. The data are listed in the following
table.
35
The straightness of the Q-Q plot can be measured ba calculating the
correlation coefficient of the points in the plot. The correlation coefficient for
the Q-Q plot is defined by
n
P
(x(j) − x̄)(q(j) − q̄)
j=1
rQ = s s
n
P Pn
(x(j) − x̄)2 (q(j) − q̄)2
j=1 j=1
and a powerful test of normality can be based on it. Formally we reject the
hypothesis of normality at level of significance α if rQ fall below the appropriate
value in the following table
36
Example 3.11 (A correlation coefficient test for normality) Let us calculate
the correlation coefficient rQ from Q-Q plot of Example 3.9 and test for
normality.
37
Linear combinations of more than one characteristic can be investigated.
Many statistician suggest plotting
in which λ̂1 is the largest eigenvalue of S. Here x0j = [xj1, xj2, . . . , xjp] is
the jth observation on p variables X1, X2, . . . , Xp. The linear combination
êpxj corresponding to the smallest eigenvalue is also frequently singled out for
inspection
38
Evaluating Bivariate Normality
(x − µ)0Σ−1(x − µ) ≤ χ22(0.5)
has probability 0.5.
• Thus we should expect roughly the same percentage, 50%, of sample
observations lie in the ellipse given by
where µ is replaced by x̂and Σ−1 by its estimate S−1. If not, the normality
assumption is suspect.
40
Example 3.13 (Constructing a chi-square plot) Let us construct a chi-square
plot of the generalized distances given in Example 3.12. The order distance and
the corresponding chi-square percentile for p = 2 and n = 10 are listed in the
following table:
41
42
Example 3.14 (Evaluating multivariate normality for a four-variable data
set) The data in Table 4.3 were obtained by taking four different measures of
stiffness, x1, x2, x3, and x4, of each of n = 30 boards. the first measurement
involving sending a shock wave down the board, the second measurement
is determined while vibrating the board, and the last two measurements are
obtained from static tests. The squared distances dj = (xj − x̄)0S−1(xj − x̄) are
also presented in the table
43
44
3.7 Detecting Outliers and Cleaning Data
• For a single random variable, the problem is one dimensional, and we look
for observations that are far from the others.
• In the bivariate case, the situation is more complicated. Figure 4.10 shows a
situation with two unusual observations.
45
46
Steps for Detecting Outliers
1. Math a dot plot for each variable.
2. Make a scatter plot for each pair of variables.
√
3. Calculate the standardize variable zjk = (xjk − x̄k )/ skk for j = 1, 2, . . . , n
and each column k = 1, 2, . . . , p. Examine these standardized values for large
or small values.
4. Calculate the generalized squared distance (xj − x̄)0S−1(xj − x̄). Examine
these distances for unusually values. In a chi-square plot, these would be the
points farthest from the origin.
47
Example 3.15 (Detecting outliers in the data on lumber) Table 4.4 contains
the data in Table 4.3, along with the standardized observations. These data
consist of four different measurements of stiffness x1, x2, x3 and x4, on each
n = 30 boards. Detect outliers in these data.
48
49
3.8 Transformations to Near Normality
If normality is not a viable assumption, what is the next step ?
• Ignore the findings of a normality check and proceed as if the data were
normality distributed. ( Not recommend)
1. theoretical consideration
50
• Helpful Transformations To Near Normality
Original Scale Transformed Scale
√
1. Counts, y y
p̂
2. Proportions, p̂ logit = 12 log 1−p̂
3. Correlations, r Fisher’s z(r) = 21 log 1+r
1−r
Given the observations x1, x2, . . . , xn, the Box-Cox transformation for the
choice of an appropriate power λ is the solution that maximizes the express
n n
n 1 X (λ) ¯ )2 + (λ − 1)
X
`(λ) = − ln (xj − x(λ) ln xj
2 n j=1 j=1
n
¯ = 1
P xλ
j −1
where x(λ) n λ . 51
j=1
Example 3.16 (Determining a power transformation for univariate data)
We gave readings of microwave radiation emitted through the closed doors of
n = 42 ovens in Example 3.10. The Q-Q plot of these data in Figure 4.6
indicates that the observations deviate from what would be expected if they
were normally distributed. Since all the positive observations are positive, let
us perform a power transformation of the data which, we hope, will produce
results that are more nearly normal. We must find that value of λ maximize the
function `(λ).
52
53
Transforming Multivariate Observations
• If not, the value λ̂1, λ̂2, . . . , λ̂p can be obtained from the preceding
transformations and iterate toward the set of values λ0 = [λ1, λ2, . . . , λp],
which collectively maximizes
n n
n X X
`(λ1, λ2, . . . , λp) = − ln |S(λ)| + (λ1 − 1) ln xj1 + (λ2 − 1) ln xj2
2 j=1 j=1
n
X
+ · · · + (λp − 1) ln xjp
j=1
56
57
58
If the data includes some large negative values and have a single long tail, a
more general transformation should be applied.
λ
{(x + 1) − 1}/λ x ≥ 0, λ 6= 0
ln(x + 1) x ≥ 0, λ = 0
x(λ) = 2−λ
−{(−x + 1) − 1}/(2 − λ) x < 0, λ 6= 2
− ln(−x + 1) x < 0, λ = 2
59